# Catalog Sync CLI `catalogsync` 是一套独立于 GUI 的采集、同步、下载链路,目标是把“发现”页里的“歌单广场”和“排行榜”来源抽出来,变成可以自动跑批的命令行工具。 当前支持的平台分两层: - 歌单采集源: - `netease` - `qq` - `kuwo` - 下载解析源: - `qq` - `kuwo` - `migu` - `qianqian` - `kugou` - `netease` 设计重点: - 将“歌单池 -> 歌单 -> 歌曲”持久化到 SQLite - 同步歌单歌曲时,派生更新“歌手池 -> 歌手 -> 歌曲” - 下载时按歌曲主键和有效文件位置去重 - 为本地磁盘、云盘、对象存储保留统一的文件位置抽象 ## 文档导览 本文件同时覆盖四类信息: - 项目用途与运行链路(`collect -> sync -> download -> upload`) - 代码架构(CLI、采集同步、下载上传、Ops Console) - 数据库设计(业务实体、文件映射、任务编排) - 服务器部署与运维(NAS/Linux 目录规范、脚本、日志、重启) 如果你是首次接手项目,建议按这个顺序阅读: 1. 先看“代码架构”和“数据库设计总览” 2. 再看“命令”和“NAS / Linux 落地约定” 3. 最后看文末的 Ops Console 更新说明 ## 代码架构 这套系统是“命令入口 + 领域服务 + 仓储层 + 后台任务控制台”四层结构,核心目标是把“采集/同步/下载/上传”拆成可组合、可恢复、可观察的流水线。 ### 目录与职责边界 ```text musicdl/catalogsync/ cli.py # 命令入口与参数解析;组装 Application runtime.py # 运行时路径/端口/目录规范(env -> config) db.py # SQLite schema、索引、补列迁移、连接参数 models.py # 领域模型与元信息提取 repository.py # catalog 侧数据读写(歌单/歌曲/文件/统计) services.py # 采集 + 同步编排(playlist -> songs -> artists) downloader.py # 下载规划 + 多源候选优选 + 落盘 + 去重入库 resolver.py # 跨平台候选搜歌、评分、降级策略 uploader.py # 对象存储补传、上传队列消费、presence 刷新 collectors/ # 歌单源采集器(网易/QQ/酷我) ops/ web.py # FastAPI 页面与 API(dashboard/playlists/jobs) repository.py # ops 侧任务仓储(job/stage/item/worker) runner.py # 后台调度器(lane、抢占、恢复、收敛) executors.py # stage 执行器(collect/sync/download/upload) maintenance.py # 本地重复文件巡检与去重 config.py # 环境配置读取/写回/版本快照 models.py # Job/Stage/Item 状态枚举与数据结构 ``` 边界约束: - `services.py` 只负责“业务编排”,不直接做 UI/任务调度 - `repository.py` 负责 SQL 读写,不关心下载/上传策略 - `ops/runner.py` 负责“如何跑任务”,不直接定义采集/下载规则 - `ops/executors.py` 负责“一个 item 怎么执行”,并通过 CAS 更新状态 ### 两条主链路 1. CLI 直跑链路(离线批处理) - `cli.py` -> `CatalogSyncApplication` - `collect/sync/download/run/upload` 直接调用 `services/downloader/uploader` - 适合脚本化批量任务或单次命令执行 2. Ops 任务链路(可视化 + 可暂停恢复) - `ops/web.py` 受理任务创建(`/api/jobs`、`/api/playlists/*`) - `ops/runner.py` 按 `job_type` 拆 stage,轮询调度 - `ops/executors.py` 逐 item 执行并回写 `job_*` 表 - 前端通过 dashboard API + SSE 读取实时状态 ### 关键调用序列(以“同步后下载”任务为例) 1. Web 端创建 `sync_download` 任务,写入 `job_runs` 2. runner 建立 `job_stages`:`sync -> download` 3. sync stage 为每个歌单生成 `job_items`,执行 `services.sync_playlist_row` 4. download stage 为歌曲生成 `job_items`,执行 `downloader.download_song_row` 5. 下载命中后写入 `file_assets` + `file_locations`,并刷新歌单状态聚合 6. runner 汇总 stage/item 计数,更新 `job_runs` 到 `completed/completed_with_errors` ### 任务并发与恢复模型 - 双 lane 调度: - `download` lane:独占型,限制并发,避免磁盘与网络争用 - `general` lane:用于 collect/sync/upload,支持更高并发 - stage 内并发: - 由 worker 数控制(下载默认 10,可配置) - worker 心跳/速度/当前项写入 `job_workers` - 断点恢复: - runner 启动时扫描 recoverable job - 运行中 item 置为 `interrupted` - 可恢复 item 重新入队,任务状态转 `paused` 或继续 `running` - 命令控制: - pause/resume/cancel/retry 写入 `job_commands` - runner 统一消费命令,避免并发写冲突 ### 可扩展点(后续加平台/加存储时看这里) - 新歌单源:实现 `collectors/*` + 在 `services.py` 注册 - 新下载源:扩展 `resolver.py` 候选检索与评分策略 - 新存储后端:扩展 `uploader.py` 的 backend 适配与 locator 语义 - 新任务类型:在 `ops/jobdefs.py` 增加 stage 序列与显示名称 - 新运维能力:在 `ops/web.py` 加 API,在 `ops/repository.py` 落状态模型 ### 任务状态流转图(JobStatus) 下面图示对应 `ops/models.py` 中的 `JobStatus`: ```mermaid stateDiagram-v2 [*] --> queued queued --> running: runner claim queued --> canceled: cancel running --> pause_requested: pause command pause_requested --> paused: all running items drained paused --> running: resume command running --> completed: all items success/skipped running --> completed_with_errors: some items failed running --> failed: unrecoverable error running --> canceled: cancel pause_requested --> canceled: cancel completed --> [*] completed_with_errors --> [*] failed --> [*] canceled --> [*] ``` ## 命令 初始化数据库: ```bash musicdl-catalogsync init-db --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary ``` 采集“歌单广场”和“排行榜”来源: ```bash musicdl-catalogsync collect --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo ``` 同步数据库里已有歌单: ```bash musicdl-catalogsync sync --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo --limit 20 ``` 下载待下载歌曲: ```bash musicdl-catalogsync download --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10 ``` 按默认链路一把跑完: ```bash musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10 ``` 按歌单文件直接跑: ```bash musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --playlist-file D:\catalogsync\playlists.txt --download-sources qq,kuwo,migu,qianqian,kugou,netease --workers 10 ``` 注册一个对象存储后端: ```bash musicdl-catalogsync register-object-backend ^ --db D:\catalogsync\catalogsync.db ^ --backend main-s3 ^ --bucket music-bucket ^ --endpoint https://s3.example.com ^ --region auto ^ --base-prefix music ^ --credential-env-prefix CATALOGSYNC_MAIN_S3 ``` 把本地已下载文件补传到对象存储: ```bash musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --workers 4 musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200 musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4 ``` 启动 ops web console(FastAPI + uvicorn): ```bash musicdl-catalogsync serve --db D:\catalogsync\catalogsync.db --env-file D:\catalogsync\catalogsync.env --host 127.0.0.1 --port 18080 ``` 也可以直接用模块方式启动: ```bash python -m musicdl.catalogsync.cli --help ``` ## `--playlist-file` 行为 传入 `--playlist-file` 时,`run` 会走一条窄分支: 1. 跳过 `collect` 2. 读取文件中的歌单 URL 3. 解析并去重 4. 以 `manual_file` 池的形式写入数据库 5. 只同步这些歌单 6. 只下载这些歌单关联到的歌曲 不传 `--playlist-file` 时,仍然保持原来的 `collect -> sync -> download` 默认行为。 ## `--sources` 与 `--download-sources` - `--sources` - 控制要采集 / 同步 / 过滤哪些 canonical 平台歌曲 - 当前主要用于 `netease`、`qq`、`kuwo` 这三类歌单来源 - `--download-sources` - 控制下载前要去哪些平台重新搜歌、解析直链 - 默认值是 GUI 同款六平台:`qq,kuwo,migu,qianqian,kugou,netease` 下载阶段的实际行为是: 1. 先从数据库中的 canonical song 取歌名、歌手、原始快照 2. 在 `--download-sources` 白名单里重新找可下载候选 3. 对候选按“匹配度 -> 音质 / 文件大小 -> 你配置的源顺序”排序 4. 选出最佳候选后再真正下载 这意味着: - 网易云歌单里的歌,不一定由网易云下载 - 原平台官方直链过期或不可用时,会自动去其它下载源找同名同歌手候选 - 只要匹配可信,优先选择质量更高的候选 `sync` 阶段从这一版开始也不再要求“原平台当场给出可下载直链”: - 只要歌单接口还能返回歌曲元信息,`sync` 就会把歌曲快照完整写入数据库 - 这些歌曲会以“延迟解析”快照入库,真正下载时再按 `--download-sources` 去补可用直链 - 这样可以避免网易云 / QQ / 酷我因为版权或临时直链失效,导致歌曲在入库阶段被提前丢掉 ### 文件格式 每行一种,支持以下三类: ```text # 注释行 https://music.163.com/#/playlist?id=17745989905 qq,https://y.qq.com/n/ryqq/playlist/7707261125 https://y.qq.com/n/ryqq/toplist/26 https://www.kuwo.cn/rankList?bangId=16 ``` 规则: - 空行忽略 - `#` 开头的行忽略 - 支持 `平台,URL` - 也支持只写 URL,此时会自动识别平台 - 同一文件里的重复歌单会自动去重 - 当前支持自动识别的 URL 平台为 `netease`、`qq`、`kuwo` ### 支持的 URL 类型 - 网易云普通歌单:`https://music.163.com/#/playlist?id=...` - QQ 普通歌单:`https://y.qq.com/n/ryqq/playlist/...` - QQ 排行榜:`https://y.qq.com/n/ryqq/toplist/...` - 酷我普通歌单:`https://www.kuwo.cn/playlist_detail/...` - 酷我排行榜:`https://www.kuwo.cn/rankList?bangId=...` ## 数据库设计总览 数据库使用 SQLite,连接策略为: - `PRAGMA journal_mode=WAL` - `PRAGMA busy_timeout=30000` - `PRAGMA synchronous=NORMAL` - 所有表在 `db.py` 中集中定义,并在初始化时执行补列迁移 设计目标: 1. 强去重:同一平台同一远端 ID 只保留一条实体 2. 弱耦合:歌曲逻辑资产与物理存储位置分离 3. 可恢复:任务状态机可持久化并支持重启续跑 4. 可观测:任务、worker、日志、事件都有落表 ### 表域拆分(四大域) 1. 目录实体域(Catalog Core) - `playlist_pools`: 歌单来源池(广场/榜单/manual_file) - `playlists`: 歌单主体(平台、远端 ID、策略、播放量) - `songs`: 歌曲主体(平台、远端 ID、名称、歌手、格式、快照) - `artists`: 歌手主体(归一化名称 + 平台维度) 2. 关系映射域(Association) - `pool_playlists`: 池与歌单多对多 - `playlist_songs`: 歌单与歌曲多对多(含 position) - `pool_artists`: 池与歌手多对多 - `artist_songs`: 歌手与歌曲多对多 3. 文件资产域(Storage) - `storage_backends`: 存储后端定义(local_fs/object_storage/cloud_drive) - `file_assets`: 歌曲文件逻辑版本(质量/格式/大小/checksum) - `file_locations`: 物理位置(backend + locator + 状态 + 主副本) - `song_backend_presence`: 歌曲在后端的聚合存在性(加速查询) - `download_tasks` / `upload_tasks`: 下载上传队列 4. 任务编排域(Ops) - `job_runs`: 任务总览(类型、状态、范围、配置快照) - `job_stages`: 阶段(collect/sync/download/upload)计数器 - `job_items`: 最小执行单元(歌单项/歌曲项/文件项) - `job_workers`: worker 实时状态、吞吐、速度 - `job_commands`: pause/resume/cancel/retry 命令队列 - `job_events` / `job_logs`: 审计事件与执行日志 - `config_revisions`: 环境配置版本快照与回滚记录 ### 去重与一致性约束(核心) 唯一键(强约束): - `playlists(platform, remote_playlist_id)` - `songs(platform, remote_song_id)` - `file_locations(file_asset_id, backend_id, locator)` - `upload_tasks(file_asset_id, target_backend_id, target_locator)` - `job_items(job_stage_id, item_key)` 一致性规则(业务层): - 同一 `song_id` 可对应多个 `file_asset`(不同质量/格式) - 同一 `file_asset` 可有多个 `file_location`(本地 + 云端) - `song_backend_presence` 由 `file_locations` 派生,不作为事实源 - 歌单“已下载/未下载/部分”状态由 `playlist_songs + active local file_locations` 聚合计算 ### 高频读写路径(排障重点) 1. 采集阶段 - 写:`playlist_pools`, `playlists`, `pool_playlists` - 典型问题:池里有歌单但 `playlists.collected_song_count` 未回填 2. 同步阶段 - 写:`songs`, `playlist_songs`, `artists`, `pool_artists`, `artist_songs` - 典型问题:歌单已同步但歌曲数为 0(需区分“源返回空”与“解析失败”) 3. 下载阶段 - 写:`file_assets`, `file_locations`, `download_tasks` - 读:`songs` 快照 + 下载源候选 - 典型问题:文件重复落盘、`(1)/(2)` 命名膨胀 4. 上传阶段 - 写:`upload_tasks`, `file_locations`, `song_backend_presence` - 典型问题:上传成功但 presence 未刷新导致界面仍显示未上传 5. 任务中心 - 写:`job_runs/stages/items/workers/commands/events/logs` - 读:dashboard 汇总、doing/done 树、worker 速度 ### 迁移与向后兼容 - `initialize_database()` 每次启动都会: - 执行 `CREATE TABLE IF NOT EXISTS` - 执行必要 `ALTER TABLE ADD COLUMN`(如 `play_count`、worker 吞吐字段) - 这保证了旧库可直接升级,不需要手工跑 SQL migration 脚本 - 升级前建议备份 `catalogsync.db`,尤其在调整去重策略与批量维护前 ### 核心 ER 简图 ```mermaid erDiagram PLAYLIST_POOLS ||--o{ POOL_PLAYLISTS : links PLAYLISTS ||--o{ POOL_PLAYLISTS : belongs_to PLAYLISTS ||--o{ PLAYLIST_SONGS : contains SONGS ||--o{ PLAYLIST_SONGS : appears_in ARTIST_POOLS ||--o{ POOL_ARTISTS : links ARTISTS ||--o{ POOL_ARTISTS : belongs_to ARTISTS ||--o{ ARTIST_SONGS : sings SONGS ||--o{ ARTIST_SONGS : performed_by SONGS ||--o{ FILE_ASSETS : has_versions FILE_ASSETS ||--o{ FILE_LOCATIONS : stored_at STORAGE_BACKENDS ||--o{ FILE_LOCATIONS : hosts SONGS ||--o{ SONG_BACKEND_PRESENCE : has_presence STORAGE_BACKENDS ||--o{ SONG_BACKEND_PRESENCE : summarized_on JOB_RUNS ||--o{ JOB_STAGES : has JOB_STAGES ||--o{ JOB_ITEMS : has JOB_RUNS ||--o{ JOB_WORKERS : owns JOB_RUNS ||--o{ JOB_COMMANDS : receives JOB_RUNS ||--o{ JOB_EVENTS : emits JOB_RUNS ||--o{ JOB_LOGS : writes ``` ## 数据表 ### 歌单池 -> 歌单 -> 歌曲 - `playlist_pools` - 平台来源池,比如 `playlist_square`、`toplist`、`manual_file` - `playlists` - 具体歌单或榜单 - `pool_playlists` - 歌单池和歌单的映射 - `songs` - 歌曲主表,唯一键为 `(platform, remote_song_id)` - `playlist_songs` - 歌单和歌曲的映射 歌曲主表会保存这些核心信息: - `remote_song_id` - `name` - `singers` - `ext` - `file_size_bytes` - `quality_label` - `metadata_json` - 包含 `SongInfo` 快照,后续可直接恢复给原下载器继续下载 ### 派生歌手池 + 懒加载补全 - `artist_pools` - 由歌单池派生出的歌手池 - `artists` - 歌手主表 - `pool_artists` - 歌手池和歌手的映射 - `artist_songs` - 歌手和歌曲的映射 同步歌单歌曲时,会一起更新歌手池,满足“歌单池更新时,同时更新歌手池”的要求。 ## 下载去重与文件映射 ### 逻辑资产层 - `file_assets` - 表示“某首歌的某一种文件版本” - 常见维度是 `song_id + quality_label + ext + file_size_bytes` - `ext / quality_label / file_size_bytes` 以实际下载命中的音源文件为准,不强绑 canonical 平台 ### 物理位置层 - `storage_backends` - 描述存储后端 - 当前已实现 `local_fs` - 后续可扩展到云盘和对象存储 - `file_locations` - 记录某个文件资产当前实际存在哪 可以这样理解: - `file_assets` 回答“这是什么文件” - `file_locations` 回答“这个文件现在放在哪” 如果一首歌先下载到本地,后面再上传到云盘或对象存储,可以继续复用同一个 `file_asset`,只需追加或更新对应的 `file_location`。 ### 上传队列与后端可达性 - `song_backend_presence` - 派生汇总表,表示某首歌在某个 backend 上是否已有 active 文件 - 常用于快速判断“这首歌是否已经补传到 main-s3” - `upload_tasks` - 上传任务队列表 - 一条任务 = 一个本地 `file_asset` 上传到一个目标 backend/key - 状态包括 `pending`、`uploading`、`succeeded`、`failed`、`skipped` 这里要特别区分: - `file_locations` 仍然是事实来源 - `song_backend_presence` 只是为了快速查询,不替代 `file_locations` ## 磁盘不足时的行为 下载器会优先检查目标目录剩余空间。 如果空间不足,会提示输入新的下载目录: ```text 磁盘空间不足,请输入新的下载目录继续: ``` 新目录可以位于另一个盘符。程序会: - 把歌曲下载到新目录 - 为新目录自动创建或复用一个 `storage_backend` - 把新的文件位置写回 `file_locations` 在 `--workers > 1` 时,仍然只会出现一次全局提示。切换成功后,后续尚未开始的下载任务会统一改用新目录继续。 ## 对象存储上传 当前已经实现第一版对象存储上传,后端语义按 S3-compatible 处理。 ### 关键约定 1. 本地下载完成后,会先写入一条本地 `file_location` 2. 上传成功后,会为同一个 `file_asset` 新增一条远端 `file_location` 3. 本地文件仍保留,且本地 `file_location.is_primary = 1` 4. 远端对象存储记录为 `is_primary = 0` 5. 默认信数据库状态,不对远端对象额外做 `HEAD` 校验 6. 同一首歌如果本地有多个 active 文件版本,会全部入队上传 ### key / locator 规则 对象存储 key 会镜像本地相对路径。 例如: - 本地 locator:`qq/Singer A/song-a.flac` - backend `base_prefix`:`music` - 远端 locator:`music/qq/Singer A/song-a.flac` 这样做的好处是: - 目录结构和本地一致 - 后续迁移或重新建立映射更简单 - 上传到 CDN / 云盘时也更容易复用相同 locator 语义 ### backend 配置与密钥模型 非敏感配置写在 `storage_backends.config_json` 中,例如: - `endpoint` - `region` - `base_prefix` - `addressing_style` - `public_base_url` - `credential_env_prefix` 敏感密钥不落库,只走环境变量。 例如 `credential_env_prefix = CATALOGSYNC_MAIN_S3` 时: ```dotenv CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID=your-access-key CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY=your-secret-key CATALOGSYNC_MAIN_S3_SESSION_TOKEN=optional-session-token ``` 如果配置了 `public_base_url`,上传成功后会顺手把可推导出来的 `public_url` 写回远端 `file_location`。 ### upload 命令默认行为 `upload` 默认会做三件事: 1. 找出目标 backend 上仍缺失的本地 active 文件 2. 去重后写入或复用 `upload_tasks` 3. 用有限并发 worker 执行上传并回写数据库 支持按以下维度缩小范围: - `--sources` - `--playlist-ids` - `--limit` - `--workers` 默认建议: - 下载:`--workers 10` - 上传:`--workers 4` ### 上传后数据库会更新什么 - `file_locations` - 新增或更新远端对象位置 - `song_backend_presence` - 刷新该歌曲在目标 backend 上的 active 汇总 - `upload_tasks` - 记录本次任务的排队、执行、成功或失败状态 ## 云盘兼容预留 推荐约定: - 本地文件: - `backend_type=local_fs` - `locator` 保存相对路径 - 对象存储: - `backend_type=object_storage` - `container_name` 保存 bucket - `locator` 保存 key - 云盘类后端: - `backend_type=cloud_drive` - `remote_file_id` 保存平台文件 ID - `locator` 保存远端目录路径 ## 当前实现说明 - 采集层已经覆盖 GUI “发现”页中的“歌单广场”和“排行榜”来源 - 榜单特殊解析已支持: - `netease_toplist` - `qq_toplist` - `kuwo_toplist` - 下载链路已解耦“歌单来源”和“下载来源” - 下载时会在 `--download-sources` 指定的平台里重新搜歌 - 候选优选策略为: - 高可信匹配优先 - 在高可信候选里优先更高音质 / 更大文件 - 音质相近时按 `--download-sources` 的顺序决定优先级 - 默认下载源为 GUI 同款六平台:`qq,kuwo,migu,qianqian,kugou,netease` - 对象存储上传当前已实现 `register-object-backend` + `upload` 两条命令链路 ## 运行建议 - 首次跑批建议先从单一平台开始,例如 `--sources netease` - `sync` 和 `download` 建议先带 `--limit` 做冒烟验证 - 如果只想跑少量指定歌单,优先使用 `run --playlist-file` ## NAS / Linux 落地约定 ### 目录职责拆分 - `/volume4/Music_Cloud/library` - 只存放最终音乐文件(下载产物) - `/volume4/Music_Cloud/catalogsync` - 只存放 catalogsync 应用与运行数据(代码、副本脚本、配置、数据库、输入、日志) 建议固定结构: ```text /volume4/Music_Cloud/ library/ catalogsync/ app/ bin/ config/ data/ inputs/ logs/ ``` ### 下载布局 默认下载布局为: ```text /// ``` 其中 `DOWNLOAD_LAYOUT=platform_first_artist` 对应上述目录结构。 这里的 `` 指的是“实际命中的下载源平台”,不是歌单来源平台。 ### `catalogsync.env` 关键项示例 ```dotenv ROOT_DIR=/volume4/Music_Cloud APP_HOME=/volume4/Music_Cloud/catalogsync LIBRARY_DIR=/volume4/Music_Cloud/library DB_PATH=/volume4/Music_Cloud/catalogsync/data/catalogsync.db INPUT_DIR=/volume4/Music_Cloud/catalogsync/inputs LOG_DIR=/volume4/Music_Cloud/catalogsync/logs ENV_FILE=/volume4/Music_Cloud/catalogsync/config/catalogsync.env WEB_HOST=127.0.0.1 WEB_PORT=18080 PYTHON_BIN=python3 VENV_DIR=/volume4/Music_Cloud/catalogsync/app/.venv DOWNLOAD_LAYOUT=platform_first_artist DOWNLOAD_SOURCES=qq,kuwo,migu,qianqian,kugou,netease CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server OBJECT_BACKEND_NAME=main-s3 OBJECT_BUCKET=music-bucket OBJECT_ENDPOINT=https://s3.example.com OBJECT_REGION=auto OBJECT_BASE_PREFIX=music OBJECT_ADDRESSING_STYLE= OBJECT_PUBLIC_BASE_URL= OBJECT_CREDENTIAL_ENV_PREFIX=CATALOGSYNC_MAIN_S3 UPLOAD_WORKERS=4 UPLOAD_SOURCES= UPLOAD_PLAYLIST_IDS= UPLOAD_LIMIT= CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID= CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY= CATALOGSYNC_MAIN_S3_SESSION_TOKEN= ``` ### Windows 一键部署到 NAS(推荐) 如果你在 Windows 本地开发并部署到固定 NAS,推荐使用一条命令: ```powershell .\deploy-catalogsync.ps1 ``` 该命令会串联: 1. 本地上传 `musicdl/catalogsync` 到 NAS staging 目录 2. 覆盖 NAS 上最新 `serve_console.sh` 与 `deploy_and_restart.sh` 3. 在 NAS 端执行原子部署脚本(备份 -> 同步 -> 停旧 -> 起新 -> 探活) 4. 若探活或单实例校验失败,自动回滚到上一个版本并返回非 0 可选参数: ```powershell .\deploy-catalogsync.ps1 -SkipHealthCheck ``` 脚本位置: - 仓库快捷入口:`deploy-catalogsync.ps1` - NAS 部署触发:`scripts/catalogsync/deploy_to_nas.ps1` - NAS 部署执行:`scripts/catalogsync/templates/deploy_and_restart.sh` ### NAS 端部署脚本行为(`deploy_and_restart.sh`) 脚本默认目标路径: - 代码目标:`/volume4/Music_Cloud/catalogsync/app/musicdl/catalogsync` - staging:`/volume4/Music_Cloud/catalogsync/deploy/staging/catalogsync` - 备份:`/volume4/Music_Cloud/catalogsync/deploy/backups/catalogsync_YYYYMMDD_HHMMSS` 稳定性机制: - 部署锁:`/volume4/Music_Cloud/catalogsync/run/deploy.lock` - 服务 PID:`/volume4/Music_Cloud/catalogsync/run/serve.pid` - 健康检查:默认 `http://127.0.0.1:${WEB_PORT}/dashboard` - 失败回滚:自动恢复最近备份并重启验证 - 备份保留:默认保留最近 5 个版本(可用 `--keep-backups` 调整) ### `scripts/catalogsync/bootstrap_to_linux.ps1` 用法 在 Windows 侧执行(会通过 `ssh/scp` 初始化目标机目录并分发代码与脚本模板): ```powershell powershell -ExecutionPolicy Bypass -File .\scripts\catalogsync\bootstrap_to_linux.ps1 ` -RemoteHost 192.168.1.10 ` -Port 22 ` -User xiaoming ` -RootDir /volume4/Music_Cloud ``` 执行后请在目标机把 `catalogsync.env.example` 复制为 `catalogsync.env` 并按机器实际路径调整。 ### 目标机先执行 `install_runtime.sh` 目标机第一次部署完成后,建议先跑一次: ```bash bash /volume4/Music_Cloud/catalogsync/bin/install_runtime.sh ``` 这条脚本会自动完成几件事: - 使用 `PYTHON_BIN` 创建 `VENV_DIR` - 升级 `pip/setuptools/wheel` - 从 `/volume4/Music_Cloud/catalogsync/app/requirements.txt` 生成 `/volume4/Music_Cloud/catalogsync/app/requirements.nas.txt` - 自动过滤 `nodejs-wheel` - 安装 `catalogsync` 当前下载/上传链路所需依赖 - 对 `/volume4/Music_Cloud/catalogsync/app` 执行一次 editable install,使 `python -m musicdl.catalogsync.cli ...` 可直接运行 日志会写到: ```text /volume4/Music_Cloud/catalogsync/logs/install_runtime_YYYYMMDD_HHMMSS.log ``` ### 目标机 `download_all.sh` / `download_from_file.sh` 用法 在目标机执行前先准备: ```bash cp /volume4/Music_Cloud/catalogsync/config/catalogsync.env.example \ /volume4/Music_Cloud/catalogsync/config/catalogsync.env ``` 全量流程(等价于 `musicdl.catalogsync.cli run`): ```bash bash /volume4/Music_Cloud/catalogsync/bin/download_all.sh --sources netease,qq,kuwo --limit 20 ``` 按歌单文件跑(跳过 collect): ```bash bash /volume4/Music_Cloud/catalogsync/bin/download_from_file.sh \ /volume4/Music_Cloud/catalogsync/inputs/playlists.txt ``` 该脚本对应 `run --playlist-file` 分支(跳过 `collect`),因此示例中不再携带 `--sources`。 这两个下载脚本都会自动读取 `catalogsync.env` 里的 `DOWNLOAD_SOURCES`,并转成 `--download-sources ...` 传给 CLI。 这两个下载脚本会优先使用 `VENV_DIR/bin/python`;如果虚拟环境还没准备好,才回退到 `PYTHON_BIN`。 ### 下载后 catalog 导出(NAS 联动建议开启) 为让 `Music_Server` 的只读库 `catalog_read.db` 在下载后自动刷新,建议在 `catalogsync.env` 配置: - `CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh` - `CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server` 行为说明: - 每次 `download` stage 进入终态后触发一次(同一 stage 仅触发一次) - 未配置 `CATALOG_EXPORT_COMMAND` 时,本次导出标记为 `skipped` - `job_events` 会记录以下事件: - `catalog_export_started` - `catalog_export_skipped` - `catalog_export_succeeded` - `catalog_export_failed` ### 目标机 `upload_all.sh` 用法 对象存储上传脚本位于: ```text /volume4/Music_Cloud/catalogsync/bin/upload_all.sh ``` 它会先按 `catalogsync.env` 中的配置自动执行一次 `register-object-backend`,再执行 `upload`,因此改了 bucket、endpoint、CDN 基地址后,不需要单独再手工注册一次。 最简单的跑法: ```bash bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh ``` 如果只想补传指定来源或指定歌单,也可以在脚本后面直接追加 CLI 参数: ```bash bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --sources netease,qq --limit 200 bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --playlist-ids 12,15 --workers 6 ``` 这条脚本同样会优先使用 `VENV_DIR/bin/python`;如果虚拟环境不存在,才回退到 `PYTHON_BIN`。 这条脚本依赖以下 env: - `OBJECT_BACKEND_NAME` - `OBJECT_BUCKET` - `OBJECT_ENDPOINT` - `OBJECT_REGION` - `OBJECT_BASE_PREFIX` - `OBJECT_ADDRESSING_STYLE` - `OBJECT_PUBLIC_BASE_URL` - `OBJECT_CREDENTIAL_ENV_PREFIX` - `${OBJECT_CREDENTIAL_ENV_PREFIX}_ACCESS_KEY_ID` - `${OBJECT_CREDENTIAL_ENV_PREFIX}_SECRET_ACCESS_KEY` - `${OBJECT_CREDENTIAL_ENV_PREFIX}_SESSION_TOKEN` - `UPLOAD_WORKERS` - `UPLOAD_SOURCES` - `UPLOAD_PLAYLIST_IDS` - `UPLOAD_LIMIT` 日志会写到: ```text /volume4/Music_Cloud/catalogsync/logs/upload_all_YYYYMMDD_HHMMSS.log ``` ### 目标机 `serve_console.sh` 用法 ops 控制台脚本位于: ```text /volume4/Music_Cloud/catalogsync/bin/serve_console.sh ``` 运行示例: ```bash bash /volume4/Music_Cloud/catalogsync/bin/serve_console.sh ``` 脚本会自动读取 `catalogsync.env` 中的 `DB_PATH`、`ENV_FILE`、`WEB_HOST`、`WEB_PORT` 并透传给 `musicdl.catalogsync.cli serve`。 单实例保护机制: - 锁目录:`/volume4/Music_Cloud/catalogsync/run/serve.lock` - PID 文件:`/volume4/Music_Cloud/catalogsync/run/serve.pid` - 如果已存在活跃实例,脚本会直接失败退出,避免重复启动 日志会写到: ```text /volume4/Music_Cloud/catalogsync/logs/serve_console_YYYYMMDD_HHMMSS.log ``` ### NAS 依赖安装备注 这台 NAS 的系统 Python 是 `Python 3.8`,并且缺少 `nodejs-wheel-binaries` 需要的本地编译工具链。 当前 `catalogsync` 的下载、对象存储上传、`netease/qq/kuwo` 这条链路不依赖 `nodejs-wheel`,因此建议直接使用上面的 `install_runtime.sh`。它会自动生成并安装过滤后的 `requirements.nas.txt`,不需要再手工执行 `grep`。 ## `/playlists` 歌单池管理页(选择性下载) `/playlists` 现已作为歌单池管理页使用,面向“筛选歌单 -> 选择目标 -> 执行批量动作”的运维流程。 支持筛选参数: - `platform` - `pool_kind` - `status` - `keyword` - `wanted_only` - `page_size` 列表支持当前页勾选,并提供整页全选/清空。 当前支持四个批量动作: - 下载已同步所选歌单 - 同步后下载所选歌单 - 加入待下载清单 - 移出待下载清单 歌单状态语义: - 未同步:该歌单尚未完成同步 - 未下载:已同步但仍有待下载歌曲 - 下载中:存在进行中的下载任务 - 部分已下载:部分歌曲已落盘,仍有剩余未完成 - 已下载:歌单内歌曲均满足“已下载”口径 “已下载”口径:对同一 `song_id`,只要本地存在 `active` 的 `local_fs` 文件,即判定该歌曲下载完成。 页面动作最终仍复用现有 job 系统: - 下载已同步所选歌单 -> `download_only` - 同步后下载所选歌单 -> `sync_download` - 上述两类任务的区别在 `playlist_scope.playlist_ids` ## Operations Console Update As of `2026-04-16`, the operations console behavior has changed in three important ways: 1. `musicdl-catalogsync serve` now starts the web console together with an embedded ops runner. 2. `/dashboard` now exposes a create-job form plus live job/download summary, active workers, and running items. 3. `/jobs/{id}` now exposes a command form for `pause`, `resume`, `cancel`, `retry_item`, and `force_retry_item`, together with worker and running-item detail. Current job type to stage mapping: - `catalog_sync`: `collect -> sync -> download` - `collect_only`: `collect` - `sync_only`: `sync` - `sync_download`: `sync -> download` - `download_only`: `download` - `upload_only`: `upload` - `download_upload`: `download -> upload` Collector behavior update: - playlist square collection now paginates for `netease` and `kuwo` - `qq` playlist-square failures are isolated so other sources continue This means the console is no longer read-only: creating a job from the dashboard should enqueue work that the embedded runner can execute without starting a second process. As of `2026-04-17`, the deployed NAS console was verified again and the following operational fixes are also part of the live behavior: 1. `/dashboard` now exposes `Quick Launch`, `Active Job`, `Running Songs`, and `Playlist Coverage`, and the `Active Job` / `Recent Jobs` blocks now provide direct `pause` / `resume` / `cancel` buttons, so the operator can both observe progress and control the current queue from one page. 2. `/jobs/{id}` now exposes direct action buttons for `pause`, `resume`, `cancel`, `retry_item`, and `force_retry_item` instead of only relying on a generic command dropdown. 3. Collect-stage workers now emit page-level progress text such as `page N: +X, total Y`, which makes it clear whether collection is advancing or stuck. Collector and runtime hardening in this round: - `QQCollector` playlist-square requests now send the required `Referer` and `Origin` headers, which restored non-zero QQ playlist-square collection on NAS. - `netease` and `kuwo` playlist-square pagination now stops when the upstream explicitly reports `has_more = false` or when a page is entirely duplicate playlists, preventing long-running repeated-page loops. - NAS runtime compatibility was extended for Python `3.8` by removing runtime-evaluated built-in generic aliases from the serve import path. - SQLite connections now enable `busy_timeout` and `journal_mode=WAL`, which prevents the operations console from intermittently failing with `database is locked` while the embedded runner is writing progress. Observed NAS verification snapshot after redeploying these fixes: - `GET http://192.168.5.43:18080/dashboard` returned `200 OK` with the new controls visible. - Ten consecutive requests to `/api/dashboard` returned `200 OK` while `collect_only` job `3` was running. - Total playlists on NAS grew from the earlier `811` baseline to `1441` during live verification. - QQ playlists on NAS grew from `25` to `629+` during the same verification window, confirming that QQ playlist-square collection was no longer stuck at zero. ## 2026-04-17 NAS Restart Note During the `2026-04-17` restart verification on NAS, the web console and the embedded runner did not recover equally: - the web process restarted and continued serving `/dashboard`, `/jobs/{id}`, and `/api/dashboard` - a stale duplicate `serve` process had to be removed manually before the NAS converged back to a single web instance - after duplicate cleanup, the embedded runner still failed to advance queued work even though manual `OpsRepository` / `OpsRunner` recovery calls succeeded against the same database Operational workaround used on NAS: - web console kept running as `/volume4/Music_Cloud/catalogsync/app/.venv/bin/python -m musicdl.catalogsync.cli serve ...` - a separate emergency runner process was started to execute `OpsRunner.run_forever()` against the same SQLite database - verification after the workaround showed `job 5` resume correctly and `downloaded_songs` increase from `82` to `85` Temporary NAS-only emergency runner details: - PID: `17516` - log: `/volume4/Music_Cloud/catalogsync/logs/ops_runner_20260417_101958.log` Resolution on `2026-04-17 10:29`: - `musicdl/catalogsync/ops/web.py` now supervises the embedded runner thread and automatically restarts it after transient exceptions instead of letting the web process continue without background execution - local regression coverage now includes an embedded-runner recovery test that forces one loop failure and verifies that queued work is still completed after automatic restart - NAS was redeployed with this fix and the temporary emergency runner was removed - after restart, NAS converged back to a single live `serve` process on port `18080` - the restarted web process recovered the interrupted download job back to `paused`, accepted a `resume` command, and then continued downloading without any standalone runner - live verification on NAS showed `downloaded_songs` increase from `100` to `102` under the single embedded-runner setup ## 2026-04-17 Progress Visibility Update - the playlists page now renders a `Progress` column with `downloaded / total`, a percentage bar, and the current running-song count - the job detail page now renders a `Playlist Progress` table for playlist-scoped jobs - job playlist progress is derived from playlist-song links, active local files, and download-stage job items of the current job - songs that were already present locally before the job started still count as completed progress for that playlist - empty boolean-like filters such as `/playlists?wanted_only=` and `/api/playlists?wanted_only=` are accepted and treated as `false` ## 2026-04-17 Non-Music Skip + Task Center Tree - download stage now classifies QQ toplist fallback entries (`remote_song_id` starts with `qqtop_` or metadata marks `qq_toplist_fallback`) as `skipped` instead of `failed` - skipped toplist entries are annotated with `非音乐资源(有声榜条目)` - new API: `GET /api/jobs/{job_id}/playlists/{playlist_id}/songs` returns per-song progress rows for one playlist inside one job - dashboard Task Center removed the old `Open` jump link and keeps operations inline - task detail now supports hierarchical expansion: - task -> playlist progress rows - playlist row -> lazy-loaded song progress rows - song rows explicitly show `非音乐资源` tag when matched ## 2026-04-17 Stable Task Tree Refresh - dashboard `Task Center` no longer renders the embedded `Summary / Stages / Workers / Running Items` detail tables - the dashboard now presents one stable tree: - task - playlist - song - task lifecycle transitions such as `paused`, `completed`, `completed_with_errors`, and `canceled` keep the same task node visible in Task Center instead of making the row disappear immediately - live refresh updates task nodes in place so expanded tasks and expanded playlists can remain open across refresh cycles ## 2026-04-18 Dashboard Maintenance: Local Duplicate Scan / Dedupe - `Dashboard` now includes a `Maintenance` card for local duplicate inspection. - `Scan Duplicate Local Copies` calls `GET /api/maintenance/local-duplicates`. - `Run Local Dedupe` calls `POST /api/maintenance/local-duplicates/dedupe`. - The scan groups active local duplicate rows by `(file_asset_id, backend_id)`. - Keep rule priority: 1. existing file wins 2. non-`(1)` / non-`(2)` canonical locator wins 3. shorter locator wins 4. smaller `file_locations.id` wins - Dedupe execution updates references before inactivation: - repoint `upload_tasks.source_location_id` - repoint `job_items.file_location_id` - mark duplicate `file_locations.status = 'inactive'` - delete duplicate local files when they still exist on disk - refresh `song_backend_presence` - Safety guard: - dedupe is rejected with `409` while any `job_runs.status = 'running'` or `job_items.status = 'running'` - this avoids colliding with active download / upload execution - The dashboard renders results inline and does not jump away from the page. ## 2026-04-18 Playlist Export Pipeline Update - `playlists/` directory generation is no longer triggered by `sync`. - `CatalogSyncService.sync_playlist_row()` now only handles playlist-song linking and play-count backfill. - Playlist export artifacts are refreshed from the download side for scoped playlist jobs: - `download_only` - `sync_download` - The runner refreshes export folders when an individual scoped playlist finishes downloading, instead of waiting for the whole download job to finish. - On runner restart / recovery, scoped download stages also backfill export folders for playlists whose items were already completed before the restart. - Stage-final export refresh is still kept as the last safety net, including the `0`-pending-items case where all files already existed locally. - Existing single-playlist export remains available: - `GET /api/playlists/{playlist_id}/export-folder` - it refreshes the folder from current database state only - it does not auto-download missing songs - New bulk export API: - `POST /api/playlists/export` - routes selected playlists by current state - `downloaded` -> export immediately - `unsynced` -> create `sync_download` job - `not_downloaded` / `partial` / `downloading` -> create `download_only` job - Playlists page adds `Export Selected Playlists`: - already-downloaded playlists can be exported without re-downloading songs - not-yet-synced or not-yet-downloaded playlists are queued into the appropriate job automatically ## 2026-04-19 Local ZIP Export + Adaptive Download - Playlists page no longer shows a standalone `Sync Then Download` button. - `Download Selected Playlists` is now adaptive: - `unsynced` playlists are routed to `sync_download` - already-synced but incomplete playlists are routed to `download_only` - mixed selections may create both a `download_job` and a `sync_download_job` - already-downloaded playlists can be skipped without forcing a re-download - Export semantics now mean browser download to the operator's local machine: - modal `Export` downloads `GET /api/playlists/{playlist_id}/export.zip` - list `Export Selected` calls `POST /api/playlists/export-zip` - when every selected playlist is ready, the API returns `status=ready` plus `download_url` - when any selected playlist is not ready, the API returns `status=queued` plus job details instead of a partial ZIP - Prepared bundle downloads are served by: - `GET /api/exports/bundles/{bundle_name}.zip` - `GET /api/playlists/{playlist_id}/export-folder` remains available as an internal server-side folder refresh / inspection endpoint, but it is no longer the user-facing export action.