Files
musicdl-catalog-sync-suite/catalog-sync/docs/catalogsync.md
T

42 KiB
Raw Blame History

Catalog Sync CLI

catalogsync 是一套独立于 GUI 的采集、同步、下载链路,目标是把“发现”页里的“歌单广场”和“排行榜”来源抽出来,变成可以自动跑批的命令行工具。

当前支持的平台分两层:

  • 歌单采集源:
    • netease
    • qq
    • kuwo
  • 下载解析源:
    • qq
    • kuwo
    • migu
    • qianqian
    • kugou
    • netease

设计重点:

  • 将“歌单池 -> 歌单 -> 歌曲”持久化到 SQLite
  • 同步歌单歌曲时,派生更新“歌手池 -> 歌手 -> 歌曲”
  • 下载时按歌曲主键和有效文件位置去重
  • 为本地磁盘、云盘、对象存储保留统一的文件位置抽象

文档导览

本文件同时覆盖四类信息:

  • 项目用途与运行链路(collect -> sync -> download -> upload
  • 代码架构(CLI、采集同步、下载上传、Ops Console)
  • 数据库设计(业务实体、文件映射、任务编排)
  • 服务器部署与运维(NAS/Linux 目录规范、脚本、日志、重启)

如果你是首次接手项目,建议按这个顺序阅读:

  1. 先看“代码架构”和“数据库设计总览”
  2. 再看“命令”和“NAS / Linux 落地约定”
  3. 最后看文末的 Ops Console 更新说明

代码架构

这套系统是“命令入口 + 领域服务 + 仓储层 + 后台任务控制台”四层结构,核心目标是把“采集/同步/下载/上传”拆成可组合、可恢复、可观察的流水线。

目录与职责边界

musicdl/catalogsync/
  cli.py                 # 命令入口与参数解析;组装 Application
  runtime.py             # 运行时路径/端口/目录规范(env -> config
  db.py                  # SQLite schema、索引、补列迁移、连接参数
  models.py              # 领域模型与元信息提取
  repository.py          # catalog 侧数据读写(歌单/歌曲/文件/统计)
  services.py            # 采集 + 同步编排(playlist -> songs -> artists
  downloader.py          # 下载规划 + 多源候选优选 + 落盘 + 去重入库
  resolver.py            # 跨平台候选搜歌、评分、降级策略
  uploader.py            # 对象存储补传、上传队列消费、presence 刷新
  collectors/            # 歌单源采集器(网易/QQ/酷我)
  ops/
    web.py               # FastAPI 页面与 APIdashboard/playlists/jobs
    repository.py        # ops 侧任务仓储(job/stage/item/worker
    runner.py            # 后台调度器(lane、抢占、恢复、收敛)
    executors.py         # stage 执行器(collect/sync/download/upload
    maintenance.py       # 本地重复文件巡检与去重
    config.py            # 环境配置读取/写回/版本快照
    models.py            # Job/Stage/Item 状态枚举与数据结构

边界约束:

  • services.py 只负责“业务编排”,不直接做 UI/任务调度
  • repository.py 负责 SQL 读写,不关心下载/上传策略
  • ops/runner.py 负责“如何跑任务”,不直接定义采集/下载规则
  • ops/executors.py 负责“一个 item 怎么执行”,并通过 CAS 更新状态

两条主链路

  1. CLI 直跑链路(离线批处理)
    • cli.py -> CatalogSyncApplication
    • collect/sync/download/run/upload 直接调用 services/downloader/uploader
    • 适合脚本化批量任务或单次命令执行
  2. Ops 任务链路(可视化 + 可暂停恢复)
    • ops/web.py 受理任务创建(/api/jobs/api/playlists/*
    • ops/runner.pyjob_type 拆 stage,轮询调度
    • ops/executors.py 逐 item 执行并回写 job_*
    • 前端通过 dashboard API + SSE 读取实时状态

关键调用序列(以“同步后下载”任务为例)

  1. Web 端创建 sync_download 任务,写入 job_runs
  2. runner 建立 job_stagessync -> download
  3. sync stage 为每个歌单生成 job_items,执行 services.sync_playlist_row
  4. download stage 为歌曲生成 job_items,执行 downloader.download_song_row
  5. 下载命中后写入 file_assets + file_locations,并刷新歌单状态聚合
  6. runner 汇总 stage/item 计数,更新 job_runscompleted/completed_with_errors

任务并发与恢复模型

  • 双 lane 调度:
    • download lane:独占型,限制并发,避免磁盘与网络争用
    • general lane:用于 collect/sync/upload,支持更高并发
  • stage 内并发:
    • 由 worker 数控制(下载默认 10,可配置)
    • worker 心跳/速度/当前项写入 job_workers
  • 断点恢复:
    • runner 启动时扫描 recoverable job
    • 运行中 item 置为 interrupted
    • 可恢复 item 重新入队,任务状态转 paused 或继续 running
  • 命令控制:
    • pause/resume/cancel/retry 写入 job_commands
    • runner 统一消费命令,避免并发写冲突

可扩展点(后续加平台/加存储时看这里)

  • 新歌单源:实现 collectors/* + 在 services.py 注册
  • 新下载源:扩展 resolver.py 候选检索与评分策略
  • 新存储后端:扩展 uploader.py 的 backend 适配与 locator 语义
  • 新任务类型:在 ops/jobdefs.py 增加 stage 序列与显示名称
  • 新运维能力:在 ops/web.py 加 API,在 ops/repository.py 落状态模型

任务状态流转图(JobStatus

下面图示对应 ops/models.py 中的 JobStatus

stateDiagram-v2
    [*] --> queued
    queued --> running: runner claim
    queued --> canceled: cancel
    running --> pause_requested: pause command
    pause_requested --> paused: all running items drained
    paused --> running: resume command
    running --> completed: all items success/skipped
    running --> completed_with_errors: some items failed
    running --> failed: unrecoverable error
    running --> canceled: cancel
    pause_requested --> canceled: cancel
    completed --> [*]
    completed_with_errors --> [*]
    failed --> [*]
    canceled --> [*]

命令

初始化数据库:

musicdl-catalogsync init-db --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary

采集“歌单广场”和“排行榜”来源:

musicdl-catalogsync collect --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo

同步数据库里已有歌单:

musicdl-catalogsync sync --db D:\catalogsync\catalogsync.db --sources netease,qq,kuwo --limit 20

下载待下载歌曲:

musicdl-catalogsync download --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10

按默认链路一把跑完:

musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --sources netease,qq,kuwo --download-sources qq,kuwo,migu,qianqian,kugou,netease --limit 20 --workers 10

按歌单文件直接跑:

musicdl-catalogsync run --db D:\catalogsync\catalogsync.db --library-root E:\MusicLibrary --playlist-file D:\catalogsync\playlists.txt --download-sources qq,kuwo,migu,qianqian,kugou,netease --workers 10

注册一个对象存储后端:

musicdl-catalogsync register-object-backend ^
  --db D:\catalogsync\catalogsync.db ^
  --backend main-s3 ^
  --bucket music-bucket ^
  --endpoint https://s3.example.com ^
  --region auto ^
  --base-prefix music ^
  --credential-env-prefix CATALOGSYNC_MAIN_S3

把本地已下载文件补传到对象存储:

musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --workers 4
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4

启动 ops web consoleFastAPI + uvicorn):

musicdl-catalogsync serve --db D:\catalogsync\catalogsync.db --env-file D:\catalogsync\catalogsync.env --host 127.0.0.1 --port 18080

也可以直接用模块方式启动:

python -m musicdl.catalogsync.cli --help

--playlist-file 行为

传入 --playlist-file 时,run 会走一条窄分支:

  1. 跳过 collect
  2. 读取文件中的歌单 URL
  3. 解析并去重
  4. manual_file 池的形式写入数据库
  5. 只同步这些歌单
  6. 只下载这些歌单关联到的歌曲

不传 --playlist-file 时,仍然保持原来的 collect -> sync -> download 默认行为。

--sources--download-sources

  • --sources
    • 控制要采集 / 同步 / 过滤哪些 canonical 平台歌曲
    • 当前主要用于 neteaseqqkuwo 这三类歌单来源
  • --download-sources
    • 控制下载前要去哪些平台重新搜歌、解析直链
    • 默认值是 GUI 同款六平台:qq,kuwo,migu,qianqian,kugou,netease

下载阶段的实际行为是:

  1. 先从数据库中的 canonical song 取歌名、歌手、原始快照
  2. --download-sources 白名单里重新找可下载候选
  3. 对候选按“匹配度 -> 音质 / 文件大小 -> 你配置的源顺序”排序
  4. 选出最佳候选后再真正下载

这意味着:

  • 网易云歌单里的歌,不一定由网易云下载
  • 原平台官方直链过期或不可用时,会自动去其它下载源找同名同歌手候选
  • 只要匹配可信,优先选择质量更高的候选

sync 阶段从这一版开始也不再要求“原平台当场给出可下载直链”:

  • 只要歌单接口还能返回歌曲元信息,sync 就会把歌曲快照完整写入数据库
  • 这些歌曲会以“延迟解析”快照入库,真正下载时再按 --download-sources 去补可用直链
  • 这样可以避免网易云 / QQ / 酷我因为版权或临时直链失效,导致歌曲在入库阶段被提前丢掉

文件格式

每行一种,支持以下三类:

# 注释行
https://music.163.com/#/playlist?id=17745989905
qq,https://y.qq.com/n/ryqq/playlist/7707261125
https://y.qq.com/n/ryqq/toplist/26
https://www.kuwo.cn/rankList?bangId=16

规则:

  • 空行忽略
  • # 开头的行忽略
  • 支持 平台,URL
  • 也支持只写 URL,此时会自动识别平台
  • 同一文件里的重复歌单会自动去重
  • 当前支持自动识别的 URL 平台为 neteaseqqkuwo

支持的 URL 类型

  • 网易云普通歌单:https://music.163.com/#/playlist?id=...
  • QQ 普通歌单:https://y.qq.com/n/ryqq/playlist/...
  • QQ 排行榜:https://y.qq.com/n/ryqq/toplist/...
  • 酷我普通歌单:https://www.kuwo.cn/playlist_detail/...
  • 酷我排行榜:https://www.kuwo.cn/rankList?bangId=...

数据库设计总览

数据库使用 SQLite,连接策略为:

  • PRAGMA journal_mode=WAL
  • PRAGMA busy_timeout=30000
  • PRAGMA synchronous=NORMAL
  • 所有表在 db.py 中集中定义,并在初始化时执行补列迁移

设计目标:

  1. 强去重:同一平台同一远端 ID 只保留一条实体
  2. 弱耦合:歌曲逻辑资产与物理存储位置分离
  3. 可恢复:任务状态机可持久化并支持重启续跑
  4. 可观测:任务、worker、日志、事件都有落表

表域拆分(四大域)

  1. 目录实体域(Catalog Core
    • playlist_pools: 歌单来源池(广场/榜单/manual_file
    • playlists: 歌单主体(平台、远端 ID、策略、播放量)
    • songs: 歌曲主体(平台、远端 ID、名称、歌手、格式、快照)
    • artists: 歌手主体(归一化名称 + 平台维度)
  2. 关系映射域(Association
    • pool_playlists: 池与歌单多对多
    • playlist_songs: 歌单与歌曲多对多(含 position)
    • pool_artists: 池与歌手多对多
    • artist_songs: 歌手与歌曲多对多
  3. 文件资产域(Storage
    • storage_backends: 存储后端定义(local_fs/object_storage/cloud_drive
    • file_assets: 歌曲文件逻辑版本(质量/格式/大小/checksum
    • file_locations: 物理位置(backend + locator + 状态 + 主副本)
    • song_backend_presence: 歌曲在后端的聚合存在性(加速查询)
    • download_tasks / upload_tasks: 下载上传队列
  4. 任务编排域(Ops
    • job_runs: 任务总览(类型、状态、范围、配置快照)
    • job_stages: 阶段(collect/sync/download/upload)计数器
    • job_items: 最小执行单元(歌单项/歌曲项/文件项)
    • job_workers: worker 实时状态、吞吐、速度
    • job_commands: pause/resume/cancel/retry 命令队列
    • job_events / job_logs: 审计事件与执行日志
    • config_revisions: 环境配置版本快照与回滚记录

去重与一致性约束(核心)

唯一键(强约束):

  • playlists(platform, remote_playlist_id)
  • songs(platform, remote_song_id)
  • file_locations(file_asset_id, backend_id, locator)
  • upload_tasks(file_asset_id, target_backend_id, target_locator)
  • job_items(job_stage_id, item_key)

一致性规则(业务层):

  • 同一 song_id 可对应多个 file_asset(不同质量/格式)
  • 同一 file_asset 可有多个 file_location(本地 + 云端)
  • song_backend_presencefile_locations 派生,不作为事实源
  • 歌单“已下载/未下载/部分”状态由 playlist_songs + active local file_locations 聚合计算

高频读写路径(排障重点)

  1. 采集阶段
    • 写:playlist_pools, playlists, pool_playlists
    • 典型问题:池里有歌单但 playlists.collected_song_count 未回填
  2. 同步阶段
    • 写:songs, playlist_songs, artists, pool_artists, artist_songs
    • 典型问题:歌单已同步但歌曲数为 0(需区分“源返回空”与“解析失败”)
  3. 下载阶段
    • 写:file_assets, file_locations, download_tasks
    • 读:songs 快照 + 下载源候选
    • 典型问题:文件重复落盘、(1)/(2) 命名膨胀
  4. 上传阶段
    • 写:upload_tasks, file_locations, song_backend_presence
    • 典型问题:上传成功但 presence 未刷新导致界面仍显示未上传
  5. 任务中心
    • 写:job_runs/stages/items/workers/commands/events/logs
    • 读:dashboard 汇总、doing/done 树、worker 速度

迁移与向后兼容

  • initialize_database() 每次启动都会:
    • 执行 CREATE TABLE IF NOT EXISTS
    • 执行必要 ALTER TABLE ADD COLUMN(如 play_count、worker 吞吐字段)
  • 这保证了旧库可直接升级,不需要手工跑 SQL migration 脚本
  • 升级前建议备份 catalogsync.db,尤其在调整去重策略与批量维护前

核心 ER 简图

erDiagram
    PLAYLIST_POOLS ||--o{ POOL_PLAYLISTS : links
    PLAYLISTS ||--o{ POOL_PLAYLISTS : belongs_to
    PLAYLISTS ||--o{ PLAYLIST_SONGS : contains
    SONGS ||--o{ PLAYLIST_SONGS : appears_in

    ARTIST_POOLS ||--o{ POOL_ARTISTS : links
    ARTISTS ||--o{ POOL_ARTISTS : belongs_to
    ARTISTS ||--o{ ARTIST_SONGS : sings
    SONGS ||--o{ ARTIST_SONGS : performed_by

    SONGS ||--o{ FILE_ASSETS : has_versions
    FILE_ASSETS ||--o{ FILE_LOCATIONS : stored_at
    STORAGE_BACKENDS ||--o{ FILE_LOCATIONS : hosts
    SONGS ||--o{ SONG_BACKEND_PRESENCE : has_presence
    STORAGE_BACKENDS ||--o{ SONG_BACKEND_PRESENCE : summarized_on

    JOB_RUNS ||--o{ JOB_STAGES : has
    JOB_STAGES ||--o{ JOB_ITEMS : has
    JOB_RUNS ||--o{ JOB_WORKERS : owns
    JOB_RUNS ||--o{ JOB_COMMANDS : receives
    JOB_RUNS ||--o{ JOB_EVENTS : emits
    JOB_RUNS ||--o{ JOB_LOGS : writes

数据表

歌单池 -> 歌单 -> 歌曲

  • playlist_pools
    • 平台来源池,比如 playlist_squaretoplistmanual_file
  • playlists
    • 具体歌单或榜单
  • pool_playlists
    • 歌单池和歌单的映射
  • songs
    • 歌曲主表,唯一键为 (platform, remote_song_id)
  • playlist_songs
    • 歌单和歌曲的映射

歌曲主表会保存这些核心信息:

  • remote_song_id
  • name
  • singers
  • ext
  • file_size_bytes
  • quality_label
  • metadata_json
    • 包含 SongInfo 快照,后续可直接恢复给原下载器继续下载

派生歌手池 + 懒加载补全

  • artist_pools
    • 由歌单池派生出的歌手池
  • artists
    • 歌手主表
  • pool_artists
    • 歌手池和歌手的映射
  • artist_songs
    • 歌手和歌曲的映射

同步歌单歌曲时,会一起更新歌手池,满足“歌单池更新时,同时更新歌手池”的要求。

下载去重与文件映射

逻辑资产层

  • file_assets
    • 表示“某首歌的某一种文件版本”
    • 常见维度是 song_id + quality_label + ext + file_size_bytes
    • ext / quality_label / file_size_bytes 以实际下载命中的音源文件为准,不强绑 canonical 平台

物理位置层

  • storage_backends
    • 描述存储后端
    • 当前已实现 local_fs
    • 后续可扩展到云盘和对象存储
  • file_locations
    • 记录某个文件资产当前实际存在哪

可以这样理解:

  • file_assets 回答“这是什么文件”
  • file_locations 回答“这个文件现在放在哪”

如果一首歌先下载到本地,后面再上传到云盘或对象存储,可以继续复用同一个 file_asset,只需追加或更新对应的 file_location

上传队列与后端可达性

  • song_backend_presence
    • 派生汇总表,表示某首歌在某个 backend 上是否已有 active 文件
    • 常用于快速判断“这首歌是否已经补传到 main-s3”
  • upload_tasks
    • 上传任务队列表
    • 一条任务 = 一个本地 file_asset 上传到一个目标 backend/key
    • 状态包括 pendinguploadingsucceededfailedskipped

这里要特别区分:

  • file_locations 仍然是事实来源
  • song_backend_presence 只是为了快速查询,不替代 file_locations

磁盘不足时的行为

下载器会优先检查目标目录剩余空间。

如果空间不足,会提示输入新的下载目录:

磁盘空间不足,请输入新的下载目录继续:

新目录可以位于另一个盘符。程序会:

  • 把歌曲下载到新目录
  • 为新目录自动创建或复用一个 storage_backend
  • 把新的文件位置写回 file_locations

--workers > 1 时,仍然只会出现一次全局提示。切换成功后,后续尚未开始的下载任务会统一改用新目录继续。

对象存储上传

当前已经实现第一版对象存储上传,后端语义按 S3-compatible 处理。

关键约定

  1. 本地下载完成后,会先写入一条本地 file_location
  2. 上传成功后,会为同一个 file_asset 新增一条远端 file_location
  3. 本地文件仍保留,且本地 file_location.is_primary = 1
  4. 远端对象存储记录为 is_primary = 0
  5. 默认信数据库状态,不对远端对象额外做 HEAD 校验
  6. 同一首歌如果本地有多个 active 文件版本,会全部入队上传

key / locator 规则

对象存储 key 会镜像本地相对路径。

例如:

  • 本地 locatorqq/Singer A/song-a.flac
  • backend base_prefixmusic
  • 远端 locatormusic/qq/Singer A/song-a.flac

这样做的好处是:

  • 目录结构和本地一致
  • 后续迁移或重新建立映射更简单
  • 上传到 CDN / 云盘时也更容易复用相同 locator 语义

backend 配置与密钥模型

非敏感配置写在 storage_backends.config_json 中,例如:

  • endpoint
  • region
  • base_prefix
  • addressing_style
  • public_base_url
  • credential_env_prefix

敏感密钥不落库,只走环境变量。

例如 credential_env_prefix = CATALOGSYNC_MAIN_S3 时:

CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID=your-access-key
CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY=your-secret-key
CATALOGSYNC_MAIN_S3_SESSION_TOKEN=optional-session-token

如果配置了 public_base_url,上传成功后会顺手把可推导出来的 public_url 写回远端 file_location

upload 命令默认行为

upload 默认会做三件事:

  1. 找出目标 backend 上仍缺失的本地 active 文件
  2. 去重后写入或复用 upload_tasks
  3. 用有限并发 worker 执行上传并回写数据库

支持按以下维度缩小范围:

  • --sources
  • --playlist-ids
  • --limit
  • --workers

默认建议:

  • 下载:--workers 10
  • 上传:--workers 4

上传后数据库会更新什么

  • file_locations
    • 新增或更新远端对象位置
  • song_backend_presence
    • 刷新该歌曲在目标 backend 上的 active 汇总
  • upload_tasks
    • 记录本次任务的排队、执行、成功或失败状态

云盘兼容预留

推荐约定:

  • 本地文件:
    • backend_type=local_fs
    • locator 保存相对路径
  • 对象存储:
    • backend_type=object_storage
    • container_name 保存 bucket
    • locator 保存 key
  • 云盘类后端:
    • backend_type=cloud_drive
    • remote_file_id 保存平台文件 ID
    • locator 保存远端目录路径

当前实现说明

  • 采集层已经覆盖 GUI “发现”页中的“歌单广场”和“排行榜”来源
  • 榜单特殊解析已支持:
    • netease_toplist
    • qq_toplist
    • kuwo_toplist
  • 下载链路已解耦“歌单来源”和“下载来源”
  • 下载时会在 --download-sources 指定的平台里重新搜歌
  • 候选优选策略为:
    • 高可信匹配优先
    • 在高可信候选里优先更高音质 / 更大文件
    • 音质相近时按 --download-sources 的顺序决定优先级
  • 默认下载源为 GUI 同款六平台:qq,kuwo,migu,qianqian,kugou,netease
  • 对象存储上传当前已实现 register-object-backend + upload 两条命令链路

运行建议

  • 首次跑批建议先从单一平台开始,例如 --sources netease
  • syncdownload 建议先带 --limit 做冒烟验证
  • 如果只想跑少量指定歌单,优先使用 run --playlist-file

NAS / Linux 落地约定

目录职责拆分

  • /volume4/Music_Cloud/library
    • 只存放最终音乐文件(下载产物)
  • /volume4/Music_Cloud/catalogsync
    • 只存放 catalogsync 应用与运行数据(代码、副本脚本、配置、数据库、输入、日志)

建议固定结构:

/volume4/Music_Cloud/
  library/
  catalogsync/
    app/
    bin/
    config/
    data/
    inputs/
    logs/

下载布局

默认下载布局为:

<LIBRARY_DIR>/<platform>/<first_artist>/<filename>

其中 DOWNLOAD_LAYOUT=platform_first_artist 对应上述目录结构。

这里的 <platform> 指的是“实际命中的下载源平台”,不是歌单来源平台。

catalogsync.env 关键项示例

ROOT_DIR=/volume4/Music_Cloud
APP_HOME=/volume4/Music_Cloud/catalogsync
LIBRARY_DIR=/volume4/Music_Cloud/library
DB_PATH=/volume4/Music_Cloud/catalogsync/data/catalogsync.db
INPUT_DIR=/volume4/Music_Cloud/catalogsync/inputs
LOG_DIR=/volume4/Music_Cloud/catalogsync/logs
ENV_FILE=/volume4/Music_Cloud/catalogsync/config/catalogsync.env
WEB_HOST=127.0.0.1
WEB_PORT=18080
PYTHON_BIN=python3
VENV_DIR=/volume4/Music_Cloud/catalogsync/app/.venv
DOWNLOAD_LAYOUT=platform_first_artist
DOWNLOAD_SOURCES=qq,kuwo,migu,qianqian,kugou,netease
CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh
CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server
OBJECT_BACKEND_NAME=main-s3
OBJECT_BUCKET=music-bucket
OBJECT_ENDPOINT=https://s3.example.com
OBJECT_REGION=auto
OBJECT_BASE_PREFIX=music
OBJECT_ADDRESSING_STYLE=
OBJECT_PUBLIC_BASE_URL=
OBJECT_CREDENTIAL_ENV_PREFIX=CATALOGSYNC_MAIN_S3
UPLOAD_WORKERS=4
UPLOAD_SOURCES=
UPLOAD_PLAYLIST_IDS=
UPLOAD_LIMIT=
CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID=
CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY=
CATALOGSYNC_MAIN_S3_SESSION_TOKEN=

Windows 一键部署到 NAS(推荐)

如果你在 Windows 本地开发并部署到固定 NAS,推荐使用一条命令:

.\deploy-catalogsync.ps1

该命令会串联:

  1. 本地上传 musicdl/catalogsync 到 NAS staging 目录
  2. 覆盖 NAS 上最新 serve_console.shdeploy_and_restart.sh
  3. 在 NAS 端执行原子部署脚本(备份 -> 同步 -> 停旧 -> 起新 -> 探活)
  4. 若探活或单实例校验失败,自动回滚到上一个版本并返回非 0

可选参数:

.\deploy-catalogsync.ps1 -SkipHealthCheck

脚本位置:

  • 仓库快捷入口:deploy-catalogsync.ps1
  • NAS 部署触发:scripts/catalogsync/deploy_to_nas.ps1
  • NAS 部署执行:scripts/catalogsync/templates/deploy_and_restart.sh

NAS 端部署脚本行为(deploy_and_restart.sh

脚本默认目标路径:

  • 代码目标:/volume4/Music_Cloud/catalogsync/app/musicdl/catalogsync
  • staging/volume4/Music_Cloud/catalogsync/deploy/staging/catalogsync
  • 备份:/volume4/Music_Cloud/catalogsync/deploy/backups/catalogsync_YYYYMMDD_HHMMSS

稳定性机制:

  • 部署锁:/volume4/Music_Cloud/catalogsync/run/deploy.lock
  • 服务 PID/volume4/Music_Cloud/catalogsync/run/serve.pid
  • 健康检查:默认 http://127.0.0.1:${WEB_PORT}/dashboard
  • 失败回滚:自动恢复最近备份并重启验证
  • 备份保留:默认保留最近 5 个版本(可用 --keep-backups 调整)

scripts/catalogsync/bootstrap_to_linux.ps1 用法

在 Windows 侧执行(会通过 ssh/scp 初始化目标机目录并分发代码与脚本模板):

powershell -ExecutionPolicy Bypass -File .\scripts\catalogsync\bootstrap_to_linux.ps1 `
  -RemoteHost 192.168.1.10 `
  -Port 22 `
  -User xiaoming `
  -RootDir /volume4/Music_Cloud

执行后请在目标机把 catalogsync.env.example 复制为 catalogsync.env 并按机器实际路径调整。

目标机先执行 install_runtime.sh

目标机第一次部署完成后,建议先跑一次:

bash /volume4/Music_Cloud/catalogsync/bin/install_runtime.sh

这条脚本会自动完成几件事:

  • 使用 PYTHON_BIN 创建 VENV_DIR
  • 升级 pip/setuptools/wheel
  • /volume4/Music_Cloud/catalogsync/app/requirements.txt 生成 /volume4/Music_Cloud/catalogsync/app/requirements.nas.txt
  • 自动过滤 nodejs-wheel
  • 安装 catalogsync 当前下载/上传链路所需依赖
  • /volume4/Music_Cloud/catalogsync/app 执行一次 editable install,使 python -m musicdl.catalogsync.cli ... 可直接运行

日志会写到:

/volume4/Music_Cloud/catalogsync/logs/install_runtime_YYYYMMDD_HHMMSS.log

目标机 download_all.sh / download_from_file.sh 用法

在目标机执行前先准备:

cp /volume4/Music_Cloud/catalogsync/config/catalogsync.env.example \
   /volume4/Music_Cloud/catalogsync/config/catalogsync.env

全量流程(等价于 musicdl.catalogsync.cli run):

bash /volume4/Music_Cloud/catalogsync/bin/download_all.sh --sources netease,qq,kuwo --limit 20

按歌单文件跑(跳过 collect):

bash /volume4/Music_Cloud/catalogsync/bin/download_from_file.sh \
  /volume4/Music_Cloud/catalogsync/inputs/playlists.txt

该脚本对应 run --playlist-file 分支(跳过 collect),因此示例中不再携带 --sources

这两个下载脚本都会自动读取 catalogsync.env 里的 DOWNLOAD_SOURCES,并转成 --download-sources ... 传给 CLI。

这两个下载脚本会优先使用 VENV_DIR/bin/python;如果虚拟环境还没准备好,才回退到 PYTHON_BIN

下载后 catalog 导出(NAS 联动建议开启)

为让 Music_Server 的只读库 catalog_read.db 在下载后自动刷新,建议在 catalogsync.env 配置:

  • CATALOG_EXPORT_COMMAND=bash /volume4/Music_Cloud/Music_Server/scripts/catalog-export.sh
  • CATALOG_EXPORT_WORKDIR=/volume4/Music_Cloud/Music_Server

行为说明:

  • 每次 download stage 进入终态后触发一次(同一 stage 仅触发一次)
  • 未配置 CATALOG_EXPORT_COMMAND 时,本次导出标记为 skipped
  • job_events 会记录以下事件:
    • catalog_export_started
    • catalog_export_skipped
    • catalog_export_succeeded
    • catalog_export_failed

目标机 upload_all.sh 用法

对象存储上传脚本位于:

/volume4/Music_Cloud/catalogsync/bin/upload_all.sh

它会先按 catalogsync.env 中的配置自动执行一次 register-object-backend,再执行 upload,因此改了 bucket、endpoint、CDN 基地址后,不需要单独再手工注册一次。

最简单的跑法:

bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh

如果只想补传指定来源或指定歌单,也可以在脚本后面直接追加 CLI 参数:

bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --sources netease,qq --limit 200
bash /volume4/Music_Cloud/catalogsync/bin/upload_all.sh --playlist-ids 12,15 --workers 6

这条脚本同样会优先使用 VENV_DIR/bin/python;如果虚拟环境不存在,才回退到 PYTHON_BIN

这条脚本依赖以下 env

  • OBJECT_BACKEND_NAME
  • OBJECT_BUCKET
  • OBJECT_ENDPOINT
  • OBJECT_REGION
  • OBJECT_BASE_PREFIX
  • OBJECT_ADDRESSING_STYLE
  • OBJECT_PUBLIC_BASE_URL
  • OBJECT_CREDENTIAL_ENV_PREFIX
  • ${OBJECT_CREDENTIAL_ENV_PREFIX}_ACCESS_KEY_ID
  • ${OBJECT_CREDENTIAL_ENV_PREFIX}_SECRET_ACCESS_KEY
  • ${OBJECT_CREDENTIAL_ENV_PREFIX}_SESSION_TOKEN
  • UPLOAD_WORKERS
  • UPLOAD_SOURCES
  • UPLOAD_PLAYLIST_IDS
  • UPLOAD_LIMIT

日志会写到:

/volume4/Music_Cloud/catalogsync/logs/upload_all_YYYYMMDD_HHMMSS.log

目标机 serve_console.sh 用法

ops 控制台脚本位于:

/volume4/Music_Cloud/catalogsync/bin/serve_console.sh

运行示例:

bash /volume4/Music_Cloud/catalogsync/bin/serve_console.sh

脚本会自动读取 catalogsync.env 中的 DB_PATHENV_FILEWEB_HOSTWEB_PORT 并透传给 musicdl.catalogsync.cli serve

单实例保护机制:

  • 锁目录:/volume4/Music_Cloud/catalogsync/run/serve.lock
  • PID 文件:/volume4/Music_Cloud/catalogsync/run/serve.pid
  • 如果已存在活跃实例,脚本会直接失败退出,避免重复启动

日志会写到:

/volume4/Music_Cloud/catalogsync/logs/serve_console_YYYYMMDD_HHMMSS.log

NAS 依赖安装备注

这台 NAS 的系统 Python 是 Python 3.8,并且缺少 nodejs-wheel-binaries 需要的本地编译工具链。

当前 catalogsync 的下载、对象存储上传、netease/qq/kuwo 这条链路不依赖 nodejs-wheel,因此建议直接使用上面的 install_runtime.sh。它会自动生成并安装过滤后的 requirements.nas.txt,不需要再手工执行 grep

/playlists 歌单池管理页(选择性下载)

/playlists 现已作为歌单池管理页使用,面向“筛选歌单 -> 选择目标 -> 执行批量动作”的运维流程。

支持筛选参数:

  • platform
  • pool_kind
  • status
  • keyword
  • wanted_only
  • page_size

列表支持当前页勾选,并提供整页全选/清空。

当前支持四个批量动作:

  • 下载已同步所选歌单
  • 同步后下载所选歌单
  • 加入待下载清单
  • 移出待下载清单

歌单状态语义:

  • 未同步:该歌单尚未完成同步
  • 未下载:已同步但仍有待下载歌曲
  • 下载中:存在进行中的下载任务
  • 部分已下载:部分歌曲已落盘,仍有剩余未完成
  • 已下载:歌单内歌曲均满足“已下载”口径

“已下载”口径:对同一 song_id,只要本地存在 activelocal_fs 文件,即判定该歌曲下载完成。

页面动作最终仍复用现有 job 系统:

  • 下载已同步所选歌单 -> download_only
  • 同步后下载所选歌单 -> sync_download
  • 上述两类任务的区别在 playlist_scope.playlist_ids

Operations Console Update

As of 2026-04-16, the operations console behavior has changed in three important ways:

  1. musicdl-catalogsync serve now starts the web console together with an embedded ops runner.
  2. /dashboard now exposes a create-job form plus live job/download summary, active workers, and running items.
  3. /jobs/{id} now exposes a command form for pause, resume, cancel, retry_item, and force_retry_item, together with worker and running-item detail.

Current job type to stage mapping:

  • catalog_sync: collect -> sync -> download
  • collect_only: collect
  • sync_only: sync
  • sync_download: sync -> download
  • download_only: download
  • upload_only: upload
  • download_upload: download -> upload

Collector behavior update:

  • playlist square collection now paginates for netease and kuwo
  • qq playlist-square failures are isolated so other sources continue

This means the console is no longer read-only: creating a job from the dashboard should enqueue work that the embedded runner can execute without starting a second process.

As of 2026-04-17, the deployed NAS console was verified again and the following operational fixes are also part of the live behavior:

  1. /dashboard now exposes Quick Launch, Active Job, Running Songs, and Playlist Coverage, and the Active Job / Recent Jobs blocks now provide direct pause / resume / cancel buttons, so the operator can both observe progress and control the current queue from one page.
  2. /jobs/{id} now exposes direct action buttons for pause, resume, cancel, retry_item, and force_retry_item instead of only relying on a generic command dropdown.
  3. Collect-stage workers now emit page-level progress text such as page N: +X, total Y, which makes it clear whether collection is advancing or stuck.

Collector and runtime hardening in this round:

  • QQCollector playlist-square requests now send the required Referer and Origin headers, which restored non-zero QQ playlist-square collection on NAS.
  • netease and kuwo playlist-square pagination now stops when the upstream explicitly reports has_more = false or when a page is entirely duplicate playlists, preventing long-running repeated-page loops.
  • NAS runtime compatibility was extended for Python 3.8 by removing runtime-evaluated built-in generic aliases from the serve import path.
  • SQLite connections now enable busy_timeout and journal_mode=WAL, which prevents the operations console from intermittently failing with database is locked while the embedded runner is writing progress.

Observed NAS verification snapshot after redeploying these fixes:

  • GET http://192.168.5.43:18080/dashboard returned 200 OK with the new controls visible.
  • Ten consecutive requests to /api/dashboard returned 200 OK while collect_only job 3 was running.
  • Total playlists on NAS grew from the earlier 811 baseline to 1441 during live verification.
  • QQ playlists on NAS grew from 25 to 629+ during the same verification window, confirming that QQ playlist-square collection was no longer stuck at zero.

2026-04-17 NAS Restart Note

During the 2026-04-17 restart verification on NAS, the web console and the embedded runner did not recover equally:

  • the web process restarted and continued serving /dashboard, /jobs/{id}, and /api/dashboard
  • a stale duplicate serve process had to be removed manually before the NAS converged back to a single web instance
  • after duplicate cleanup, the embedded runner still failed to advance queued work even though manual OpsRepository / OpsRunner recovery calls succeeded against the same database

Operational workaround used on NAS:

  • web console kept running as /volume4/Music_Cloud/catalogsync/app/.venv/bin/python -m musicdl.catalogsync.cli serve ...
  • a separate emergency runner process was started to execute OpsRunner.run_forever() against the same SQLite database
  • verification after the workaround showed job 5 resume correctly and downloaded_songs increase from 82 to 85

Temporary NAS-only emergency runner details:

  • PID: 17516
  • log: /volume4/Music_Cloud/catalogsync/logs/ops_runner_20260417_101958.log

Resolution on 2026-04-17 10:29:

  • musicdl/catalogsync/ops/web.py now supervises the embedded runner thread and automatically restarts it after transient exceptions instead of letting the web process continue without background execution
  • local regression coverage now includes an embedded-runner recovery test that forces one loop failure and verifies that queued work is still completed after automatic restart
  • NAS was redeployed with this fix and the temporary emergency runner was removed
  • after restart, NAS converged back to a single live serve process on port 18080
  • the restarted web process recovered the interrupted download job back to paused, accepted a resume command, and then continued downloading without any standalone runner
  • live verification on NAS showed downloaded_songs increase from 100 to 102 under the single embedded-runner setup

2026-04-17 Progress Visibility Update

  • the playlists page now renders a Progress column with downloaded / total, a percentage bar, and the current running-song count
  • the job detail page now renders a Playlist Progress table for playlist-scoped jobs
  • job playlist progress is derived from playlist-song links, active local files, and download-stage job items of the current job
  • songs that were already present locally before the job started still count as completed progress for that playlist
  • empty boolean-like filters such as /playlists?wanted_only= and /api/playlists?wanted_only= are accepted and treated as false

2026-04-17 Non-Music Skip + Task Center Tree

  • download stage now classifies QQ toplist fallback entries (remote_song_id starts with qqtop_ or metadata marks qq_toplist_fallback) as skipped instead of failed
  • skipped toplist entries are annotated with 非音乐资源(有声榜条目)
  • new API: GET /api/jobs/{job_id}/playlists/{playlist_id}/songs returns per-song progress rows for one playlist inside one job
  • dashboard Task Center removed the old Open jump link and keeps operations inline
  • task detail now supports hierarchical expansion:
    • task -> playlist progress rows
    • playlist row -> lazy-loaded song progress rows
    • song rows explicitly show 非音乐资源 tag when matched

2026-04-17 Stable Task Tree Refresh

  • dashboard Task Center no longer renders the embedded Summary / Stages / Workers / Running Items detail tables
  • the dashboard now presents one stable tree:
    • task
    • playlist
    • song
  • task lifecycle transitions such as paused, completed, completed_with_errors, and canceled keep the same task node visible in Task Center instead of making the row disappear immediately
  • live refresh updates task nodes in place so expanded tasks and expanded playlists can remain open across refresh cycles

2026-04-18 Dashboard Maintenance: Local Duplicate Scan / Dedupe

  • Dashboard now includes a Maintenance card for local duplicate inspection.
  • Scan Duplicate Local Copies calls GET /api/maintenance/local-duplicates.
  • Run Local Dedupe calls POST /api/maintenance/local-duplicates/dedupe.
  • The scan groups active local duplicate rows by (file_asset_id, backend_id).
  • Keep rule priority:
    1. existing file wins
    2. non-(1) / non-(2) canonical locator wins
    3. shorter locator wins
    4. smaller file_locations.id wins
  • Dedupe execution updates references before inactivation:
    • repoint upload_tasks.source_location_id
    • repoint job_items.file_location_id
    • mark duplicate file_locations.status = 'inactive'
    • delete duplicate local files when they still exist on disk
    • refresh song_backend_presence
  • Safety guard:
    • dedupe is rejected with 409 while any job_runs.status = 'running' or job_items.status = 'running'
    • this avoids colliding with active download / upload execution
  • The dashboard renders results inline and does not jump away from the page.

2026-04-18 Playlist Export Pipeline Update

  • playlists/ directory generation is no longer triggered by sync.
  • CatalogSyncService.sync_playlist_row() now only handles playlist-song linking and play-count backfill.
  • Playlist export artifacts are refreshed from the download side for scoped playlist jobs:
    • download_only
    • sync_download
  • The runner refreshes export folders when an individual scoped playlist finishes downloading, instead of waiting for the whole download job to finish.
  • On runner restart / recovery, scoped download stages also backfill export folders for playlists whose items were already completed before the restart.
  • Stage-final export refresh is still kept as the last safety net, including the 0-pending-items case where all files already existed locally.
  • Existing single-playlist export remains available:
    • GET /api/playlists/{playlist_id}/export-folder
    • it refreshes the folder from current database state only
    • it does not auto-download missing songs
  • New bulk export API:
    • POST /api/playlists/export
    • routes selected playlists by current state
    • downloaded -> export immediately
    • unsynced -> create sync_download job
    • not_downloaded / partial / downloading -> create download_only job
  • Playlists page adds Export Selected Playlists:
    • already-downloaded playlists can be exported without re-downloading songs
    • not-yet-synced or not-yet-downloaded playlists are queued into the appropriate job automatically

2026-04-19 Local ZIP Export + Adaptive Download

  • Playlists page no longer shows a standalone Sync Then Download button.
  • Download Selected Playlists is now adaptive:
    • unsynced playlists are routed to sync_download
    • already-synced but incomplete playlists are routed to download_only
    • mixed selections may create both a download_job and a sync_download_job
    • already-downloaded playlists can be skipped without forcing a re-download
  • Export semantics now mean browser download to the operator's local machine:
    • modal Export downloads GET /api/playlists/{playlist_id}/export.zip
    • list Export Selected calls POST /api/playlists/export-zip
    • when every selected playlist is ready, the API returns status=ready plus download_url
    • when any selected playlist is not ready, the API returns status=queued plus job details instead of a partial ZIP
  • Prepared bundle downloads are served by:
    • GET /api/exports/bundles/{bundle_name}.zip
  • GET /api/playlists/{playlist_id}/export-folder remains available as an internal server-side folder refresh / inspection endpoint, but it is no longer the user-facing export action.