Files
musicdl-catalog-sync-suite/catalog-sync/docs/catalogsync.md
T

1095 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Catalog Sync CLI
`catalogsync` 是一套独立于 GUI 的采集、同步、下载链路,目标是把“发现”页里的“歌单广场”和“排行榜”来源抽出来,变成可以自动跑批的命令行工具。
当前支持的平台分两层:
- 歌单采集源:
- `netease`
- `qq`
- `kuwo`
- 下载解析源:
- `qq`
- `kuwo`
- `migu`
- `qianqian`
- `kugou`
- `netease`
设计重点:
- 将“歌单池 -> 歌单 -> 歌曲”持久化到 SQLite
- 同步歌单歌曲时,派生更新“歌手池 -> 歌手 -> 歌曲”
- 下载时按歌曲主键和有效文件位置去重
- 为本地磁盘、云盘、对象存储保留统一的文件位置抽象
## 文档导览
本文件同时覆盖四类信息:
- 项目用途与运行链路(`collect -> sync -> download -> upload`
- 代码架构(CLI、采集同步、下载上传、Ops Console
- 数据库设计(业务实体、文件映射、任务编排)
- 服务器部署与运维(NAS/Linux 目录规范、脚本、日志、重启)
如果你是首次接手项目,建议按这个顺序阅读:
1. 先看“代码架构”和“数据库设计总览”
2. 再看“命令”和“NAS / Linux 落地约定”
3. 最后看文末的 Ops Console 更新说明
## 代码架构
这套系统是“命令入口 + 领域服务 + 仓储层 + 后台任务控制台”四层结构,核心目标是把“采集/同步/下载/上传”拆成可组合、可恢复、可观察的流水线。
### 目录与职责边界
```text
musicdl/catalogsync/
cli.py # 命令入口与参数解析;组装 Application
runtime.py # 运行时路径/端口/目录规范(env -> config
db.py # SQLite schema、索引、补列迁移、连接参数
models.py # 领域模型与元信息提取
repository.py # catalog 侧数据读写(歌单/歌曲/文件/统计)
services.py # 采集 + 同步编排(playlist -> songs -> artists
downloader.py # 下载规划 + 多源候选优选 + 落盘 + 去重入库
resolver.py # 跨平台候选搜歌、评分、降级策略
uploader.py # 对象存储补传、上传队列消费、presence 刷新
collectors/ # 歌单源采集器(网易/QQ/酷我)
ops/
web.py # FastAPI 页面与 APIdashboard/playlists/jobs
repository.py # ops 侧任务仓储(job/stage/item/worker
runner.py # 后台调度器(lane、抢占、恢复、收敛)
executors.py # stage 执行器(collect/sync/download/upload
maintenance.py # 本地重复文件巡检与去重
config.py # 环境配置读取/写回/版本快照
models.py # Job/Stage/Item 状态枚举与数据结构
```
边界约束:
- `services.py` 只负责“业务编排”,不直接做 UI/任务调度
- `repository.py` 负责 SQL 读写,不关心下载/上传策略
- `ops/runner.py` 负责“如何跑任务”,不直接定义采集/下载规则
- `ops/executors.py` 负责“一个 item 怎么执行”,并通过 CAS 更新状态
### 两条主链路
1. CLI 直跑链路(离线批处理)
- `cli.py` -> `CatalogSyncApplication`
- `collect/sync/download/run/upload` 直接调用 `services/downloader/uploader`
- 适合脚本化批量任务或单次命令执行
2. Ops 任务链路(可视化 + 可暂停恢复)
- `ops/web.py` 受理任务创建(`/api/jobs``/api/playlists/*`
- `ops/runner.py``job_type` 拆 stage,轮询调度
- `ops/executors.py` 逐 item 执行并回写 `job_*`
- 前端通过 dashboard API + SSE 读取实时状态
### 关键调用序列(以“同步后下载”任务为例)
1. Web 端创建 `sync_download` 任务,写入 `job_runs`
2. runner 建立 `job_stages``sync -> download`
3. sync stage 为每个歌单生成 `job_items`,执行 `services.sync_playlist_row`
4. download stage 为歌曲生成 `job_items`,执行 `downloader.download_song_row`
5. 下载命中后写入 `file_assets` + `file_locations`,并刷新歌单状态聚合
6. runner 汇总 stage/item 计数,更新 `job_runs``completed/completed_with_errors`
### 任务并发与恢复模型
- 双 lane 调度:
- `download` lane:独占型,限制并发,避免磁盘与网络争用
- `general` lane:用于 collect/sync/upload,支持更高并发
- stage 内并发:
- 由 worker 数控制(下载默认 10,可配置)
- worker 心跳/速度/当前项写入 `job_workers`
- 断点恢复:
- runner 启动时扫描 recoverable job
- 运行中 item 置为 `interrupted`
- 可恢复 item 重新入队,任务状态转 `paused` 或继续 `running`
- 命令控制:
- pause/resume/cancel/retry 写入 `job_commands`
- runner 统一消费命令,避免并发写冲突
### 可扩展点(后续加平台/加存储时看这里)
- 新歌单源:实现 `collectors/*` + 在 `services.py` 注册
- 新下载源:扩展 `resolver.py` 候选检索与评分策略
- 新存储后端:扩展 `uploader.py` 的 backend 适配与 locator 语义
- 新任务类型:在 `ops/jobdefs.py` 增加 stage 序列与显示名称
- 新运维能力:在 `ops/web.py` 加 API,在 `ops/repository.py` 落状态模型
### 任务状态流转图(JobStatus
下面图示对应 `ops/models.py` 中的 `JobStatus`
```mermaid
stateDiagram-v2
[*] --> queued
queued --> running: runner claim
queued --> canceled: cancel
running --> pause_requested: pause command
pause_requested --> paused: all running items drained
paused --> running: resume command
running --> completed: all items success/skipped
running --> completed_with_errors: some items failed
running --> failed: unrecoverable error
running --> canceled: cancel
pause_requested --> canceled: cancel
completed --> [*]
completed_with_errors --> [*]
failed --> [*]
canceled --> [*]
```
## 命令
初始化数据库:
```bash
musicdl-catalogsync init-db --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary
```
采集“歌单广场”和“排行榜”来源:
```bash
musicdl-catalogsync collect --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo
```
同步数据库里已有歌单:
```bash
musicdl-catalogsync sync --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo --limit 20
```
下载待下载歌曲:
```bash
musicdl-catalogsync download --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10
```
按默认链路一把跑完:
```bash
musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10
```
按歌单文件直接跑:
```bash
musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --playlist-file D:\catalogsync\playlists.txt --download-sources qq,kuwo,migu,qianqian,kugou,netease --workers 10
```
注册一个对象存储后端:
```bash
musicdl-catalogsync register-object-backend ^
--db D:\catalogsync\catalogsync.db ^
--backend main-s3 ^
--bucket music-bucket ^
--endpoint https://s3.example.com ^
--region auto ^
--base-prefix music ^
--credential-env-prefix CATALOGSYNC_MAIN_S3
```
把本地已下载文件补传到对象存储:
```bash
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --workers 4
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4
```
启动 ops web consoleFastAPI + uvicorn):
```bash
musicdl-catalogsync serve --db D:\catalogsync\catalogsync.db --env-file D:\catalogsync\catalogsync.env --host 127.0.0.1 --port 18080
```
也可以直接用模块方式启动:
```bash
python -m musicdl.catalogsync.cli --help
```
## `--playlist-file` 行为
传入 `--playlist-file` 时,`run` 会走一条窄分支:
1. 跳过 `collect`
2. 读取文件中的歌单 URL
3. 解析并去重
4.`manual_file` 池的形式写入数据库
5. 只同步这些歌单
6. 只下载这些歌单关联到的歌曲
不传 `--playlist-file` 时,仍然保持原来的 `collect -> sync -> download` 默认行为。
## `--sources` 与 `--download-sources`
- `--sources`
- 控制要采集 / 同步 / 过滤哪些 canonical 平台歌曲
- 当前主要用于 `netease``qq``kuwo` 这三类歌单来源
- `--download-sources`
- 控制下载前要去哪些平台重新搜歌、解析直链
- 默认值是 GUI 同款六平台:`qq,kuwo,migu,qianqian,kugou,netease`
下载阶段的实际行为是:
1. 先从数据库中的 canonical song 取歌名、歌手、原始快照
2.`--download-sources` 白名单里重新找可下载候选
3. 对候选按“匹配度 -> 音质 / 文件大小 -> 你配置的源顺序”排序
4. 选出最佳候选后再真正下载
这意味着:
- 网易云歌单里的歌,不一定由网易云下载
- 原平台官方直链过期或不可用时,会自动去其它下载源找同名同歌手候选
- 只要匹配可信,优先选择质量更高的候选
`sync` 阶段从这一版开始也不再要求“原平台当场给出可下载直链”:
- 只要歌单接口还能返回歌曲元信息,`sync` 就会把歌曲快照完整写入数据库
- 这些歌曲会以“延迟解析”快照入库,真正下载时再按 `--download-sources` 去补可用直链
- 这样可以避免网易云 / QQ / 酷我因为版权或临时直链失效,导致歌曲在入库阶段被提前丢掉
### 文件格式
每行一种,支持以下三类:
```text
# 注释行
https://music.163.com/#/playlist?id=17745989905
qq,https://y.qq.com/n/ryqq/playlist/7707261125
https://y.qq.com/n/ryqq/toplist/26
https://www.kuwo.cn/rankList?bangId=16
```
规则:
- 空行忽略
- `#` 开头的行忽略
- 支持 `平台,URL`
- 也支持只写 URL,此时会自动识别平台
- 同一文件里的重复歌单会自动去重
- 当前支持自动识别的 URL 平台为 `netease``qq``kuwo`
### 支持的 URL 类型
- 网易云普通歌单:`https://music.163.com/#/playlist?id=...`
- QQ 普通歌单:`https://y.qq.com/n/ryqq/playlist/...`
- QQ 排行榜:`https://y.qq.com/n/ryqq/toplist/...`
- 酷我普通歌单:`https://www.kuwo.cn/playlist_detail/...`
- 酷我排行榜:`https://www.kuwo.cn/rankList?bangId=...`
## 数据库设计总览
数据库使用 SQLite,连接策略为:
- `PRAGMA journal_mode=WAL`
- `PRAGMA busy_timeout=30000`
- `PRAGMA synchronous=NORMAL`
- 所有表在 `db.py` 中集中定义,并在初始化时执行补列迁移
设计目标:
1. 强去重:同一平台同一远端 ID 只保留一条实体
2. 弱耦合:歌曲逻辑资产与物理存储位置分离
3. 可恢复:任务状态机可持久化并支持重启续跑
4. 可观测:任务、worker、日志、事件都有落表
### 表域拆分(四大域)
1. 目录实体域(Catalog Core
- `playlist_pools`: 歌单来源池(广场/榜单/manual_file
- `playlists`: 歌单主体(平台、远端 ID、策略、播放量)
- `songs`: 歌曲主体(平台、远端 ID、名称、歌手、格式、快照)
- `artists`: 歌手主体(归一化名称 + 平台维度)
2. 关系映射域(Association
- `pool_playlists`: 池与歌单多对多
- `playlist_songs`: 歌单与歌曲多对多(含 position)
- `pool_artists`: 池与歌手多对多
- `artist_songs`: 歌手与歌曲多对多
3. 文件资产域(Storage
- `storage_backends`: 存储后端定义(local_fs/object_storage/cloud_drive
- `file_assets`: 歌曲文件逻辑版本(质量/格式/大小/checksum
- `file_locations`: 物理位置(backend + locator + 状态 + 主副本)
- `song_backend_presence`: 歌曲在后端的聚合存在性(加速查询)
- `download_tasks` / `upload_tasks`: 下载上传队列
4. 任务编排域(Ops
- `job_runs`: 任务总览(类型、状态、范围、配置快照)
- `job_stages`: 阶段(collect/sync/download/upload)计数器
- `job_items`: 最小执行单元(歌单项/歌曲项/文件项)
- `job_workers`: worker 实时状态、吞吐、速度
- `job_commands`: pause/resume/cancel/retry 命令队列
- `job_events` / `job_logs`: 审计事件与执行日志
- `config_revisions`: 环境配置版本快照与回滚记录
### 去重与一致性约束(核心)
唯一键(强约束):
- `playlists(platform, remote_playlist_id)`
- `songs(platform, remote_song_id)`
- `file_locations(file_asset_id, backend_id, locator)`
- `upload_tasks(file_asset_id, target_backend_id, target_locator)`
- `job_items(job_stage_id, item_key)`
一致性规则(业务层):
- 同一 `song_id` 可对应多个 `file_asset`(不同质量/格式)
- 同一 `file_asset` 可有多个 `file_location`(本地 + 云端)
- `song_backend_presence``file_locations` 派生,不作为事实源
- 歌单“已下载/未下载/部分”状态由 `playlist_songs + active local file_locations` 聚合计算
### 高频读写路径(排障重点)
1. 采集阶段
- 写:`playlist_pools`, `playlists`, `pool_playlists`
- 典型问题:池里有歌单但 `playlists.collected_song_count` 未回填
2. 同步阶段
- 写:`songs`, `playlist_songs`, `artists`, `pool_artists`, `artist_songs`
- 典型问题:歌单已同步但歌曲数为 0(需区分“源返回空”与“解析失败”)
3. 下载阶段
- 写:`file_assets`, `file_locations`, `download_tasks`
- 读:`songs` 快照 + 下载源候选
- 典型问题:文件重复落盘、`(1)/(2)` 命名膨胀
4. 上传阶段
- 写:`upload_tasks`, `file_locations`, `song_backend_presence`
- 典型问题:上传成功但 presence 未刷新导致界面仍显示未上传
5. 任务中心
- 写:`job_runs/stages/items/workers/commands/events/logs`
- 读:dashboard 汇总、doing/done 树、worker 速度
### 迁移与向后兼容
- `initialize_database()` 每次启动都会:
- 执行 `CREATE TABLE IF NOT EXISTS`
- 执行必要 `ALTER TABLE ADD COLUMN`(如 `play_count`、worker 吞吐字段)
- 这保证了旧库可直接升级,不需要手工跑 SQL migration 脚本
- 升级前建议备份 `catalogsync.db`,尤其在调整去重策略与批量维护前
### 核心 ER 简图
```mermaid
erDiagram
PLAYLIST_POOLS ||--o{ POOL_PLAYLISTS : links
PLAYLISTS ||--o{ POOL_PLAYLISTS : belongs_to
PLAYLISTS ||--o{ PLAYLIST_SONGS : contains
SONGS ||--o{ PLAYLIST_SONGS : appears_in
ARTIST_POOLS ||--o{ POOL_ARTISTS : links
ARTISTS ||--o{ POOL_ARTISTS : belongs_to
ARTISTS ||--o{ ARTIST_SONGS : sings
SONGS ||--o{ ARTIST_SONGS : performed_by
SONGS ||--o{ FILE_ASSETS : has_versions
FILE_ASSETS ||--o{ FILE_LOCATIONS : stored_at
STORAGE_BACKENDS ||--o{ FILE_LOCATIONS : hosts
SONGS ||--o{ SONG_BACKEND_PRESENCE : has_presence
STORAGE_BACKENDS ||--o{ SONG_BACKEND_PRESENCE : summarized_on
JOB_RUNS ||--o{ JOB_STAGES : has
JOB_STAGES ||--o{ JOB_ITEMS : has
JOB_RUNS ||--o{ JOB_WORKERS : owns
JOB_RUNS ||--o{ JOB_COMMANDS : receives
JOB_RUNS ||--o{ JOB_EVENTS : emits
JOB_RUNS ||--o{ JOB_LOGS : writes
```
## 数据表
### 歌单池 -> 歌单 -> 歌曲
- `playlist_pools`
- 平台来源池,比如 `playlist_square``toplist``manual_file`
- `playlists`
- 具体歌单或榜单
- `pool_playlists`
- 歌单池和歌单的映射
- `songs`
- 歌曲主表,唯一键为 `(platform, remote_song_id)`
- `playlist_songs`
- 歌单和歌曲的映射
歌曲主表会保存这些核心信息:
- `remote_song_id`
- `name`
- `singers`
- `ext`
- `file_size_bytes`
- `quality_label`
- `metadata_json`
- 包含 `SongInfo` 快照,后续可直接恢复给原下载器继续下载
### 派生歌手池 + 懒加载补全
- `artist_pools`
- 由歌单池派生出的歌手池
- `artists`
- 歌手主表
- `pool_artists`
- 歌手池和歌手的映射
- `artist_songs`
- 歌手和歌曲的映射
同步歌单歌曲时,会一起更新歌手池,满足“歌单池更新时,同时更新歌手池”的要求。
## 下载去重与文件映射
### 逻辑资产层
- `file_assets`
- 表示“某首歌的某一种文件版本”
- 常见维度是 `song_id + quality_label + ext + file_size_bytes`
- `ext / quality_label / file_size_bytes` 以实际下载命中的音源文件为准,不强绑 canonical 平台
### 物理位置层
- `storage_backends`
- 描述存储后端
- 当前已实现 `local_fs`
- 后续可扩展到云盘和对象存储
- `file_locations`
- 记录某个文件资产当前实际存在哪
可以这样理解:
- `file_assets` 回答“这是什么文件”
- `file_locations` 回答“这个文件现在放在哪”
如果一首歌先下载到本地,后面再上传到云盘或对象存储,可以继续复用同一个 `file_asset`,只需追加或更新对应的 `file_location`
### 上传队列与后端可达性
- `song_backend_presence`
- 派生汇总表,表示某首歌在某个 backend 上是否已有 active 文件
- 常用于快速判断“这首歌是否已经补传到 main-s3”
- `upload_tasks`
- 上传任务队列表
- 一条任务 = 一个本地 `file_asset` 上传到一个目标 backend/key
- 状态包括 `pending``uploading``succeeded``failed``skipped`
这里要特别区分:
- `file_locations` 仍然是事实来源
- `song_backend_presence` 只是为了快速查询,不替代 `file_locations`
## 磁盘不足时的行为
下载器会优先检查目标目录剩余空间。
如果空间不足,会提示输入新的下载目录:
```text
磁盘空间不足,请输入新的下载目录继续:
```
新目录可以位于另一个盘符。程序会:
- 把歌曲下载到新目录
- 为新目录自动创建或复用一个 `storage_backend`
- 把新的文件位置写回 `file_locations`
`--workers > 1` 时,仍然只会出现一次全局提示。切换成功后,后续尚未开始的下载任务会统一改用新目录继续。
## 对象存储上传
当前已经实现第一版对象存储上传,后端语义按 S3-compatible 处理。
### 关键约定
1. 本地下载完成后,会先写入一条本地 `file_location`
2. 上传成功后,会为同一个 `file_asset` 新增一条远端 `file_location`
3. 本地文件仍保留,且本地 `file_location.is_primary = 1`
4. 远端对象存储记录为 `is_primary = 0`
5. 默认信数据库状态,不对远端对象额外做 `HEAD` 校验
6. 同一首歌如果本地有多个 active 文件版本,会全部入队上传
### key / locator 规则
对象存储 key 会镜像本地相对路径。
例如:
- 本地 locator`qq/Singer A/song-a.flac`
- backend `base_prefix``music`
- 远端 locator`music/qq/Singer A/song-a.flac`
这样做的好处是:
- 目录结构和本地一致
- 后续迁移或重新建立映射更简单
- 上传到 CDN / 云盘时也更容易复用相同 locator 语义
### backend 配置与密钥模型
非敏感配置写在 `storage_backends.config_json` 中,例如:
- `endpoint`
- `region`
- `base_prefix`
- `addressing_style`
- `public_base_url`
- `credential_env_prefix`
敏感密钥不落库,只走环境变量。
例如 `credential_env_prefix = CATALOGSYNC_MAIN_S3` 时:
```dotenv
CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID=your-access-key
CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY=your-secret-key
CATALOGSYNC_MAIN_S3_SESSION_TOKEN=optional-session-token
```
如果配置了 `public_base_url`,上传成功后会顺手把可推导出来的 `public_url` 写回远端 `file_location`
### upload 命令默认行为
`upload` 默认会做三件事:
1. 找出目标 backend 上仍缺失的本地 active 文件
2. 去重后写入或复用 `upload_tasks`
3. 用有限并发 worker 执行上传并回写数据库
支持按以下维度缩小范围:
- `--sources`
- `--playlist-ids`
- `--limit`
- `--workers`
默认建议:
- 下载:`--workers 10`
- 上传:`--workers 4`
### 上传后数据库会更新什么
- `file_locations`
- 新增或更新远端对象位置
- `song_backend_presence`
- 刷新该歌曲在目标 backend 上的 active 汇总
- `upload_tasks`
- 记录本次任务的排队、执行、成功或失败状态
## 云盘兼容预留
推荐约定:
- 本地文件:
- `backend_type=local_fs`
- `locator` 保存相对路径
- 对象存储:
- `backend_type=object_storage`
- `container_name` 保存 bucket
- `locator` 保存 key
- 云盘类后端:
- `backend_type=cloud_drive`
- `remote_file_id` 保存平台文件 ID
- `locator` 保存远端目录路径
## 当前实现说明
- 采集层已经覆盖 GUI “发现”页中的“歌单广场”和“排行榜”来源
- 榜单特殊解析已支持:
- `netease_toplist`
- `qq_toplist`
- `kuwo_toplist`
- 下载链路已解耦“歌单来源”和“下载来源”
- 下载时会在 `--download-sources` 指定的平台里重新搜歌
- 候选优选策略为:
- 高可信匹配优先
- 在高可信候选里优先更高音质 / 更大文件
- 音质相近时按 `--download-sources` 的顺序决定优先级
- 默认下载源为 GUI 同款六平台:`qq,kuwo,migu,qianqian,kugou,netease`
- 对象存储上传当前已实现 `register-object-backend` + `upload` 两条命令链路
## 运行建议
- 首次跑批建议先从单一平台开始,例如 `--sources netease`
- `sync``download` 建议先带 `--limit` 做冒烟验证
- 如果只想跑少量指定歌单,优先使用 `run --playlist-file`
## NAS / Linux 落地约定
### 目录职责拆分
- `/volume4/Music_Cloud/library`
- 只存放最终音乐文件(下载产物)
- `/volume4/Music_Cloud/catalogsync`
- 只存放 catalogsync 应用与运行数据(代码、副本脚本、配置、数据库、输入、日志)
建议固定结构:
```text
/volume4/Music_Cloud/
library/
catalogsync/
app/
bin/
config/
data/
inputs/
logs/
```
### 下载布局
默认下载布局为:
```text
<LIBRARY_DIR>/<platform>/<first_artist>/<filename>
```
其中 `DOWNLOAD_LAYOUT=platform_first_artist` 对应上述目录结构。
这里的 `<platform>` 指的是“实际命中的下载源平台”,不是歌单来源平台。
### `catalogsync.env` 关键项示例
```dotenv
ROOT_DIR=/volume4/Music_Cloud
APP_HOME=/volume4/Music_Cloud/catalogsync
LIBRARY_DIR=/volume4/Music_Cloud/library
DB_PATH=/volume4/Music_Cloud/catalogsync/data/catalogsync.db
INPUT_DIR=/volume4/Music_Cloud/catalogsync/inputs
LOG_DIR=/volume4/Music_Cloud/catalogsync/logs
ENV_FILE=/volume4/Music_Cloud/catalogsync/config/catalogsync.env
WEB_HOST=127.0.0.1
WEB_PORT=18080
PYTHON_BIN=python3
VENV_DIR=/volume4/Music_Cloud/catalogsync/app/.venv
DOWNLOAD_LAYOUT=platform_first_artist
DOWNLOAD_SOURCES=qq,kuwo,migu,qianqian,kugou,netease
CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh
CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server
OBJECT_BACKEND_NAME=main-s3
OBJECT_BUCKET=music-bucket
OBJECT_ENDPOINT=https://s3.example.com
OBJECT_REGION=auto
OBJECT_BASE_PREFIX=music
OBJECT_ADDRESSING_STYLE=
OBJECT_PUBLIC_BASE_URL=
OBJECT_CREDENTIAL_ENV_PREFIX=CATALOGSYNC_MAIN_S3
UPLOAD_WORKERS=4
UPLOAD_SOURCES=
UPLOAD_PLAYLIST_IDS=
UPLOAD_LIMIT=
CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID=
CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY=
CATALOGSYNC_MAIN_S3_SESSION_TOKEN=
```
### Windows 一键部署到 NAS(推荐)
如果你在 Windows 本地开发并部署到固定 NAS,推荐使用一条命令:
```powershell
.\deploy-catalogsync.ps1
```
该命令会串联:
1. 本地上传 `musicdl/catalogsync` 到 NAS staging 目录
2. 覆盖 NAS 上最新 `serve_console.sh``deploy_and_restart.sh`
3. 在 NAS 端执行原子部署脚本(备份 -> 同步 -> 停旧 -> 起新 -> 探活)
4. 若探活或单实例校验失败,自动回滚到上一个版本并返回非 0
可选参数:
```powershell
.\deploy-catalogsync.ps1 -SkipHealthCheck
```
脚本位置:
- 仓库快捷入口:`deploy-catalogsync.ps1`
- NAS 部署触发:`scripts/catalogsync/deploy_to_nas.ps1`
- NAS 部署执行:`scripts/catalogsync/templates/deploy_and_restart.sh`
### NAS 端部署脚本行为(`deploy_and_restart.sh`
脚本默认目标路径:
- 代码目标:`/volume4/Music_Cloud/catalogsync/app/musicdl/catalogsync`
- staging`/volume4/Music_Cloud/catalogsync/deploy/staging/catalogsync`
- 备份:`/volume4/Music_Cloud/catalogsync/deploy/backups/catalogsync_YYYYMMDD_HHMMSS`
稳定性机制:
- 部署锁:`/volume4/Music_Cloud/catalogsync/run/deploy.lock`
- 服务 PID`/volume4/Music_Cloud/catalogsync/run/serve.pid`
- 健康检查:默认 `http://127.0.0.1:${WEB_PORT}/dashboard`
- 失败回滚:自动恢复最近备份并重启验证
- 备份保留:默认保留最近 5 个版本(可用 `--keep-backups` 调整)
### `scripts/catalogsync/bootstrap_to_linux.ps1` 用法
在 Windows 侧执行(会通过 `ssh/scp` 初始化目标机目录并分发代码与脚本模板):
```powershell
powershell -ExecutionPolicy Bypass -File .\scripts\catalogsync\bootstrap_to_linux.ps1 `
-RemoteHost 192.168.1.10 `
-Port 22 `
-User xiaoming `
-RootDir /volume4/Music_Cloud
```
执行后请在目标机把 `catalogsync.env.example` 复制为 `catalogsync.env` 并按机器实际路径调整。
### 目标机先执行 `install_runtime.sh`
目标机第一次部署完成后,建议先跑一次:
```bash
bash /volume4/Music_Cloud/catalogsync/bin/install_runtime.sh
```
这条脚本会自动完成几件事:
- 使用 `PYTHON_BIN` 创建 `VENV_DIR`
- 升级 `pip/setuptools/wheel`
-`/volume4/Music_Cloud/catalogsync/app/requirements.txt` 生成 `/volume4/Music_Cloud/catalogsync/app/requirements.nas.txt`
- 自动过滤 `nodejs-wheel`
- 安装 `catalogsync` 当前下载/上传链路所需依赖
-`/volume4/Music_Cloud/catalogsync/app` 执行一次 editable install,使 `python -m musicdl.catalogsync.cli ...` 可直接运行
日志会写到:
```text
/volume4/Music_Cloud/catalogsync/logs/install_runtime_YYYYMMDD_HHMMSS.log
```
### 目标机 `download_all.sh` / `download_from_file.sh` 用法
在目标机执行前先准备:
```bash
cp /volume4/Music_Cloud/catalogsync/config/catalogsync.env.example \
/volume4/Music_Cloud/catalogsync/config/catalogsync.env
```
全量流程(等价于 `musicdl.catalogsync.cli run`):
```bash
bash /volume4/Music_Cloud/catalogsync/bin/download_all.sh --sources netease,qq,kuwo --limit 20
```
按歌单文件跑(跳过 collect):
```bash
bash /volume4/Music_Cloud/catalogsync/bin/download_from_file.sh \
/volume4/Music_Cloud/catalogsync/inputs/playlists.txt
```
该脚本对应 `run --playlist-file` 分支(跳过 `collect`),因此示例中不再携带 `--sources`
这两个下载脚本都会自动读取 `catalogsync.env` 里的 `DOWNLOAD_SOURCES`,并转成 `--download-sources ...` 传给 CLI。
这两个下载脚本会优先使用 `VENV_DIR/bin/python`;如果虚拟环境还没准备好,才回退到 `PYTHON_BIN`
### 下载后 catalog 导出(NAS 联动建议开启)
为让 `Music_Server` 的只读库 `catalog_read.db` 在下载后自动刷新,建议在 `catalogsync.env` 配置:
- `CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh`
- `CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server`
行为说明:
- 每次 `download` stage 进入终态后触发一次(同一 stage 仅触发一次)
- 未配置 `CATALOG_EXPORT_COMMAND` 时,本次导出标记为 `skipped`
- `job_events` 会记录以下事件:
- `catalog_export_started`
- `catalog_export_skipped`
- `catalog_export_succeeded`
- `catalog_export_failed`
### 目标机 `upload_all.sh` 用法
对象存储上传脚本位于:
```text
/volume4/Music_Cloud/catalogsync/bin/upload_all.sh
```
它会先按 `catalogsync.env` 中的配置自动执行一次 `register-object-backend`,再执行 `upload`,因此改了 bucket、endpoint、CDN 基地址后,不需要单独再手工注册一次。
最简单的跑法:
```bash
bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh
```
如果只想补传指定来源或指定歌单,也可以在脚本后面直接追加 CLI 参数:
```bash
bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --sources netease,qq --limit 200
bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --playlist-ids 12,15 --workers 6
```
这条脚本同样会优先使用 `VENV_DIR/bin/python`;如果虚拟环境不存在,才回退到 `PYTHON_BIN`
这条脚本依赖以下 env
- `OBJECT_BACKEND_NAME`
- `OBJECT_BUCKET`
- `OBJECT_ENDPOINT`
- `OBJECT_REGION`
- `OBJECT_BASE_PREFIX`
- `OBJECT_ADDRESSING_STYLE`
- `OBJECT_PUBLIC_BASE_URL`
- `OBJECT_CREDENTIAL_ENV_PREFIX`
- `${OBJECT_CREDENTIAL_ENV_PREFIX}_ACCESS_KEY_ID`
- `${OBJECT_CREDENTIAL_ENV_PREFIX}_SECRET_ACCESS_KEY`
- `${OBJECT_CREDENTIAL_ENV_PREFIX}_SESSION_TOKEN`
- `UPLOAD_WORKERS`
- `UPLOAD_SOURCES`
- `UPLOAD_PLAYLIST_IDS`
- `UPLOAD_LIMIT`
日志会写到:
```text
/volume4/Music_Cloud/catalogsync/logs/upload_all_YYYYMMDD_HHMMSS.log
```
### 目标机 `serve_console.sh` 用法
ops 控制台脚本位于:
```text
/volume4/Music_Cloud/catalogsync/bin/serve_console.sh
```
运行示例:
```bash
bash /volume4/Music_Cloud/catalogsync/bin/serve_console.sh
```
脚本会自动读取 `catalogsync.env` 中的 `DB_PATH``ENV_FILE``WEB_HOST``WEB_PORT` 并透传给 `musicdl.catalogsync.cli serve`
单实例保护机制:
- 锁目录:`/volume4/Music_Cloud/catalogsync/run/serve.lock`
- PID 文件:`/volume4/Music_Cloud/catalogsync/run/serve.pid`
- 如果已存在活跃实例,脚本会直接失败退出,避免重复启动
日志会写到:
```text
/volume4/Music_Cloud/catalogsync/logs/serve_console_YYYYMMDD_HHMMSS.log
```
### NAS 依赖安装备注
这台 NAS 的系统 Python 是 `Python 3.8`,并且缺少 `nodejs-wheel-binaries` 需要的本地编译工具链。
当前 `catalogsync` 的下载、对象存储上传、`netease/qq/kuwo` 这条链路不依赖 `nodejs-wheel`,因此建议直接使用上面的 `install_runtime.sh`。它会自动生成并安装过滤后的 `requirements.nas.txt`,不需要再手工执行 `grep`
## `/playlists` 歌单池管理页(选择性下载)
`/playlists` 现已作为歌单池管理页使用,面向“筛选歌单 -> 选择目标 -> 执行批量动作”的运维流程。
支持筛选参数:
- `platform`
- `pool_kind`
- `status`
- `keyword`
- `wanted_only`
- `page_size`
列表支持当前页勾选,并提供整页全选/清空。
当前支持四个批量动作:
- 下载已同步所选歌单
- 同步后下载所选歌单
- 加入待下载清单
- 移出待下载清单
歌单状态语义:
- 未同步:该歌单尚未完成同步
- 未下载:已同步但仍有待下载歌曲
- 下载中:存在进行中的下载任务
- 部分已下载:部分歌曲已落盘,仍有剩余未完成
- 已下载:歌单内歌曲均满足“已下载”口径
“已下载”口径:对同一 `song_id`,只要本地存在 `active``local_fs` 文件,即判定该歌曲下载完成。
页面动作最终仍复用现有 job 系统:
- 下载已同步所选歌单 -> `download_only`
- 同步后下载所选歌单 -> `sync_download`
- 上述两类任务的区别在 `playlist_scope.playlist_ids`
## Operations Console Update
As of `2026-04-16`, the operations console behavior has changed in three important ways:
1. `musicdl-catalogsync serve` now starts the web console together with an embedded ops runner.
2. `/dashboard` now exposes a create-job form plus live job/download summary, active workers, and running items.
3. `/jobs/{id}` now exposes a command form for `pause`, `resume`, `cancel`, `retry_item`, and `force_retry_item`, together with worker and running-item detail.
Current job type to stage mapping:
- `catalog_sync`: `collect -> sync -> download`
- `collect_only`: `collect`
- `sync_only`: `sync`
- `sync_download`: `sync -> download`
- `download_only`: `download`
- `upload_only`: `upload`
- `download_upload`: `download -> upload`
Collector behavior update:
- playlist square collection now paginates for `netease` and `kuwo`
- `qq` playlist-square failures are isolated so other sources continue
This means the console is no longer read-only: creating a job from the dashboard should enqueue work that the embedded runner can execute without starting a second process.
As of `2026-04-17`, the deployed NAS console was verified again and the following operational fixes are also part of the live behavior:
1. `/dashboard` now exposes `Quick Launch`, `Active Job`, `Running Songs`, and `Playlist Coverage`, and the `Active Job` / `Recent Jobs` blocks now provide direct `pause` / `resume` / `cancel` buttons, so the operator can both observe progress and control the current queue from one page.
2. `/jobs/{id}` now exposes direct action buttons for `pause`, `resume`, `cancel`, `retry_item`, and `force_retry_item` instead of only relying on a generic command dropdown.
3. Collect-stage workers now emit page-level progress text such as `page N: +X, total Y`, which makes it clear whether collection is advancing or stuck.
Collector and runtime hardening in this round:
- `QQCollector` playlist-square requests now send the required `Referer` and `Origin` headers, which restored non-zero QQ playlist-square collection on NAS.
- `netease` and `kuwo` playlist-square pagination now stops when the upstream explicitly reports `has_more = false` or when a page is entirely duplicate playlists, preventing long-running repeated-page loops.
- NAS runtime compatibility was extended for Python `3.8` by removing runtime-evaluated built-in generic aliases from the serve import path.
- SQLite connections now enable `busy_timeout` and `journal_mode=WAL`, which prevents the operations console from intermittently failing with `database is locked` while the embedded runner is writing progress.
Observed NAS verification snapshot after redeploying these fixes:
- `GET http://192.168.5.43:18080/dashboard` returned `200 OK` with the new controls visible.
- Ten consecutive requests to `/api/dashboard` returned `200 OK` while `collect_only` job `3` was running.
- Total playlists on NAS grew from the earlier `811` baseline to `1441` during live verification.
- QQ playlists on NAS grew from `25` to `629+` during the same verification window, confirming that QQ playlist-square collection was no longer stuck at zero.
## 2026-04-17 NAS Restart Note
During the `2026-04-17` restart verification on NAS, the web console and the embedded runner did not recover equally:
- the web process restarted and continued serving `/dashboard`, `/jobs/{id}`, and `/api/dashboard`
- a stale duplicate `serve` process had to be removed manually before the NAS converged back to a single web instance
- after duplicate cleanup, the embedded runner still failed to advance queued work even though manual `OpsRepository` / `OpsRunner` recovery calls succeeded against the same database
Operational workaround used on NAS:
- web console kept running as `/volume4/Music_Cloud/catalogsync/app/.venv/bin/python -m musicdl.catalogsync.cli serve ...`
- a separate emergency runner process was started to execute `OpsRunner.run_forever()` against the same SQLite database
- verification after the workaround showed `job 5` resume correctly and `downloaded_songs` increase from `82` to `85`
Temporary NAS-only emergency runner details:
- PID: `17516`
- log: `/volume4/Music_Cloud/catalogsync/logs/ops_runner_20260417_101958.log`
Resolution on `2026-04-17 10:29`:
- `musicdl/catalogsync/ops/web.py` now supervises the embedded runner thread and automatically restarts it after transient exceptions instead of letting the web process continue without background execution
- local regression coverage now includes an embedded-runner recovery test that forces one loop failure and verifies that queued work is still completed after automatic restart
- NAS was redeployed with this fix and the temporary emergency runner was removed
- after restart, NAS converged back to a single live `serve` process on port `18080`
- the restarted web process recovered the interrupted download job back to `paused`, accepted a `resume` command, and then continued downloading without any standalone runner
- live verification on NAS showed `downloaded_songs` increase from `100` to `102` under the single embedded-runner setup
## 2026-04-17 Progress Visibility Update
- the playlists page now renders a `Progress` column with `downloaded / total`, a percentage bar, and the current running-song count
- the job detail page now renders a `Playlist Progress` table for playlist-scoped jobs
- job playlist progress is derived from playlist-song links, active local files, and download-stage job items of the current job
- songs that were already present locally before the job started still count as completed progress for that playlist
- empty boolean-like filters such as `/playlists?wanted_only=` and `/api/playlists?wanted_only=` are accepted and treated as `false`
## 2026-04-17 Non-Music Skip + Task Center Tree
- download stage now classifies QQ toplist fallback entries (`remote_song_id` starts with `qqtop_` or metadata marks `qq_toplist_fallback`) as `skipped` instead of `failed`
- skipped toplist entries are annotated with `非音乐资源(有声榜条目)`
- new API: `GET /api/jobs/{job_id}/playlists/{playlist_id}/songs` returns per-song progress rows for one playlist inside one job
- dashboard Task Center removed the old `Open` jump link and keeps operations inline
- task detail now supports hierarchical expansion:
- task -> playlist progress rows
- playlist row -> lazy-loaded song progress rows
- song rows explicitly show `非音乐资源` tag when matched
## 2026-04-17 Stable Task Tree Refresh
- dashboard `Task Center` no longer renders the embedded `Summary / Stages / Workers / Running Items` detail tables
- the dashboard now presents one stable tree:
- task
- playlist
- song
- task lifecycle transitions such as `paused`, `completed`, `completed_with_errors`, and `canceled` keep the same task node visible in Task Center instead of making the row disappear immediately
- live refresh updates task nodes in place so expanded tasks and expanded playlists can remain open across refresh cycles
## 2026-04-18 Dashboard Maintenance: Local Duplicate Scan / Dedupe
- `Dashboard` now includes a `Maintenance` card for local duplicate inspection.
- `Scan Duplicate Local Copies` calls `GET /api/maintenance/local-duplicates`.
- `Run Local Dedupe` calls `POST /api/maintenance/local-duplicates/dedupe`.
- The scan groups active local duplicate rows by `(file_asset_id, backend_id)`.
- Keep rule priority:
1. existing file wins
2. non-`(1)` / non-`(2)` canonical locator wins
3. shorter locator wins
4. smaller `file_locations.id` wins
- Dedupe execution updates references before inactivation:
- repoint `upload_tasks.source_location_id`
- repoint `job_items.file_location_id`
- mark duplicate `file_locations.status = 'inactive'`
- delete duplicate local files when they still exist on disk
- refresh `song_backend_presence`
- Safety guard:
- dedupe is rejected with `409` while any `job_runs.status = 'running'` or `job_items.status = 'running'`
- this avoids colliding with active download / upload execution
- The dashboard renders results inline and does not jump away from the page.
## 2026-04-18 Playlist Export Pipeline Update
- `playlists/` directory generation is no longer triggered by `sync`.
- `CatalogSyncService.sync_playlist_row()` now only handles playlist-song linking and play-count backfill.
- Playlist export artifacts are refreshed from the download side for scoped playlist jobs:
- `download_only`
- `sync_download`
- The runner refreshes export folders when an individual scoped playlist finishes downloading, instead of waiting for the whole download job to finish.
- On runner restart / recovery, scoped download stages also backfill export folders for playlists whose items were already completed before the restart.
- Stage-final export refresh is still kept as the last safety net, including the `0`-pending-items case where all files already existed locally.
- Existing single-playlist export remains available:
- `GET /api/playlists/{playlist_id}/export-folder`
- it refreshes the folder from current database state only
- it does not auto-download missing songs
- New bulk export API:
- `POST /api/playlists/export`
- routes selected playlists by current state
- `downloaded` -> export immediately
- `unsynced` -> create `sync_download` job
- `not_downloaded` / `partial` / `downloading` -> create `download_only` job
- Playlists page adds `Export Selected Playlists`:
- already-downloaded playlists can be exported without re-downloading songs
- not-yet-synced or not-yet-downloaded playlists are queued into the appropriate job automatically
## 2026-04-19 Local ZIP Export + Adaptive Download
- Playlists page no longer shows a standalone `Sync Then Download` button.
- `Download Selected Playlists` is now adaptive:
- `unsynced` playlists are routed to `sync_download`
- already-synced but incomplete playlists are routed to `download_only`
- mixed selections may create both a `download_job` and a `sync_download_job`
- already-downloaded playlists can be skipped without forcing a re-download
- Export semantics now mean browser download to the operator's local machine:
- modal `Export` downloads `GET /api/playlists/{playlist_id}/export.zip`
- list `Export Selected` calls `POST /api/playlists/export-zip`
- when every selected playlist is ready, the API returns `status=ready` plus `download_url`
- when any selected playlist is not ready, the API returns `status=queued` plus job details instead of a partial ZIP
- Prepared bundle downloads are served by:
- `GET /api/exports/bundles/{bundle_name}.zip`
- `GET /api/playlists/{playlist_id}/export-folder` remains available as an internal server-side folder refresh / inspection endpoint, but it is no longer the user-facing export action.