17 KiB
Catalogsync Operations Console Design
Goal
Extend musicdl.catalogsync with a NAS-local web operations console that can:
- manage queue-based pipeline jobs for
collect,sync,download, andupload - show playlist pool and playlist execution status as
未完成 / 进行中 / 已完成 / 异常 - show worker-level live processing state, especially which song each worker is handling
- support global soft pause and resume across all active workers
- survive process crashes or NAS restarts without restarting the whole catalog from scratch
- allow retrying a single failed or interrupted song/item instead of rerunning the whole database
- manage
catalogsync.envas the primary operator configuration source
This design targets an internal NAS console, not a public-facing multi-user product.
Scope
In Scope
- Add a NAS-local web console for
catalogsync - Add a database-backed job queue with exactly one active job at a time
- Support these job templates:
全链路仅采集仅同步同步+下载仅下载仅上传下载+上传
- Track job, stage, item, and worker state in SQLite
- Show dashboard, queue, playlist pool, worker, log, and config views
- Implement soft pause and resume
- Implement crash-safe recovery at job-item granularity
- Implement single-item retry and force-retry
- Version and edit
catalogsync.envfrom the web console - Reuse existing
musicdl.catalogsynccollectors, services, downloader, uploader, and storage model as much as possible
Out of Scope
- Multi-user login or permissions
- Public internet exposure or hardened auth
- Multiple active jobs running at the same time
- Cross-machine worker distribution
- Arbitrary user-defined stage graphs
- Provider-specific cloud drive management beyond current object storage support
- Automatic deletion of local or remote files
- Editing business data such as songs or playlists directly from the UI
Constraints
- The console runs on the NAS itself
catalogsync.envremains the configuration source of truth- A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work
- Recovery must resume from unfinished work items instead of rerunning all songs or all playlists
- Existing
musicdl.catalogsyncCLI and scripts must remain usable - The first version should optimize for operational stability, inspectability, and recoverability over architecture purity
Operator Model
Deployment Model
The web console runs on the same NAS host that already owns:
- the SQLite database
- the local music library
- the logs directory
- the runtime scripts
- the object storage configuration
This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local.
Configuration Model
catalogsync.env remains the operator-managed source of truth.
The console may:
- display current env values
- validate and save new env revisions
- apply a previous env revision as the current file
Queued jobs must store a config_snapshot_json copy of the relevant settings so:
- existing queued or running jobs stay deterministic
- later env edits only affect newly created jobs
Recommended Architecture
Use four layers:
Web Console- browser UI for dashboards, queue control, logs, and config management
Management API- serves data and accepts job or config commands
Job Orchestrator / Runner- single-process scheduler that owns queue progression, pause, resume, and recovery
Existing Catalogsync Executors- reuse
collect,sync,download, anduploadbehavior from current package modules
- reuse
Why Not A Thin Shell Wrapper
Wrapping only download_all.sh and upload_all.sh would not reliably provide:
- worker-level current song visibility
- item-level retry
- fine-grained recovery after process crashes
- stable soft pause and resume
The console therefore needs first-class job and work-item tables instead of depending only on raw shell output.
Job Model
Active Job Policy
- only one job may be
runningat a time - additional jobs stay
queued - a paused job may later resume and reclaim the active slot
This keeps:
- pause and resume semantics simple
- resource ownership clear
- crash recovery easier to reason about
Job Templates
Supported templates and stage chains:
全链路collect -> sync -> download -> upload
仅采集collect
仅同步sync
同步+下载sync -> download
仅下载download
仅上传upload
下载+上传download -> upload
Job Status
Recommended job statuses:
queuedrunningpause_requestedpausedcompletedcompleted_with_errorsfailedcanceled
Stage Status
Recommended stage statuses:
pendingrunningpause_requestedpausedcompletedfailedskipped
Work Item Status
Recommended item statuses:
pendingrunningsucceededfailedinterruptedskippedcanceled
The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart.
Data Model
Existing Table Reuse
Keep current business tables as the catalog truth:
playlist_poolsplaylistspool_playlistssongsplaylist_songsartistssong_artistsfile_locationsobject_storage_backends
These continue to answer:
- what playlists exist
- what songs belong to each playlist
- which files exist locally or remotely
The new console layer adds execution truth around them.
New Table: job_runs
Purpose:
- represent one queued or active operator job
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_type TEXT NOT NULL
status TEXT NOT NULL
priority INTEGER NOT NULL DEFAULT 100
requested_by TEXT
config_snapshot_json TEXT NOT NULL
sources TEXT
download_sources TEXT
playlist_scope_json TEXT
created_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
ended_at TEXT
last_error TEXT
resume_token TEXT
New Table: job_stages
Purpose:
- track the stage-level execution status inside one job
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
stage_type TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
seq_no INTEGER NOT NULL
total_items INTEGER NOT NULL DEFAULT 0
pending_items INTEGER NOT NULL DEFAULT 0
running_items INTEGER NOT NULL DEFAULT 0
success_items INTEGER NOT NULL DEFAULT 0
failed_items INTEGER NOT NULL DEFAULT 0
skipped_items INTEGER NOT NULL DEFAULT 0
started_at TEXT
ended_at TEXT
last_error TEXT
New Table: job_items
Purpose:
- track the real execution unit for recovery and retry
Granularity by stage:
collect- one pool/source fetch unit
sync- one playlist expansion unit
download- one song download unit
upload- one file upload unit
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_stage_id INTEGER NOT NULL
item_type TEXT NOT NULL
item_key TEXT NOT NULL
playlist_pool_id INTEGER
playlist_id INTEGER
song_id INTEGER
file_location_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
attempt_count INTEGER NOT NULL DEFAULT 0
max_attempts INTEGER NOT NULL DEFAULT 3
worker_id INTEGER
started_at TEXT
ended_at TEXT
last_error TEXT
last_error_code TEXT
payload_json TEXT
UNIQUE(job_stage_id, item_key)
New Table: job_workers
Purpose:
- surface live worker state to the UI
- show which song each worker is processing
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_name TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'idle'
current_job_item_id INTEGER
current_song_id INTEGER
current_playlist_id INTEGER
current_display_text TEXT
heartbeat_at TEXT
last_progress_text TEXT
processed_count INTEGER NOT NULL DEFAULT 0
error_count INTEGER NOT NULL DEFAULT 0
New Table: job_commands
Purpose:
- safely bridge UI actions and runner behavior
Recommended command types:
pauseresumecancelretry_itemforce_retry_item
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
command_type TEXT NOT NULL
target_item_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
payload_json TEXT
New Table: job_events
Purpose:
- structured audit trail for major runner events
Recommended event types include:
job_startedstage_starteditem_starteditem_failedpause_requestedresumedworker_heartbeatrecovery_requeued
New Table: job_logs
Purpose:
- queryable log lines for the UI
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_id INTEGER
level TEXT NOT NULL
message TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
New Table: config_revisions
Purpose:
- keep revision history of
catalogsync.env
Recommended fields:
id INTEGER PRIMARY KEY AUTOINCREMENT
source_type TEXT NOT NULL DEFAULT 'env_file'
file_path TEXT NOT NULL
content_text TEXT NOT NULL
content_hash TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
note TEXT
UI Design
Page 1: Dashboard
Show:
- current active job
- queue length
- downloaded song count
- uploaded file count
- failed item count
- per-stage summaries
- recent exceptions
- worker heartbeat overview
Page 2: Job Center
Show:
- queued jobs
- running or paused job
- job template
- scope
- stage progression
- pause, resume, cancel controls
Allow:
- creating a new job from the supported templates
- changing priority of queued jobs if desired
Page 3: Playlist Pools
Show:
- all playlist pools and playlists
- source platform
- pool kind
- song count
- downloaded count
- uploaded count
- main status
- current stage
- last processed time
- latest error summary
Derived Playlist Status Rules
Recommend deriving the main status as:
异常- any recent failed item exists for the playlist
进行中- any running or pause-requested item exists
未完成- unfinished items remain but the playlist is not actively processing
已完成- no unfinished item remains in the relevant pipeline scope
Page 4: Song Processing
Show:
- each worker and its current song
- failed songs
- interrupted songs
- retryable items
Allow:
- retry single item
- force-retry single item
- filter by stage, platform, playlist, or error state
Page 5: Logs And Exceptions
Show:
- structured events
- text logs
- job-level and item-level errors
- stack traces or HTTP error summaries where available
Page 6: Config Management
Show:
- current
catalogsync.env - parsed effective values
- validation errors
- revision history
Allow:
- save a new env revision
- re-apply a previous revision
Rule:
- config edits affect only future jobs unless an explicit resume override is supplied
API Surface
Recommended management endpoints:
GET /api/dashboardGET /api/jobsPOST /api/jobsGET /api/jobs/{id}POST /api/jobs/{id}/pausePOST /api/jobs/{id}/resumePOST /api/jobs/{id}/cancelGET /api/jobs/{id}/itemsPOST /api/job-items/{id}/retryPOST /api/job-items/{id}/force-retryGET /api/workersGET /api/playlistsGET /api/playlists/{id}GET /api/logsGET /api/config/envPUT /api/config/envGET /api/config/revisionsPOST /api/config/revisions/{id}/applyGET /api/events/stream
/api/events/stream should use server-sent events so the dashboard and worker pages can refresh without polling every table separately.
Pause, Resume, And Recovery Rules
Soft Pause
The only supported pause mode in v1 is soft pause.
Behavior:
- UI inserts a
pausecommand - the runner marks the job and current stage as
pause_requested - workers stop claiming new items
- any in-progress item is allowed to finish naturally
- once all workers are idle, the stage becomes
pausedand then the job becomespaused
This avoids half-written file state and keeps item completion boundaries clean.
Resume
Resume behavior:
- UI inserts a
resumecommand - the runner validates the job can continue
- the runner resets paused stage and job state back to
running - unstarted items stay
pending - succeeded items remain untouched
The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion.
Crash Recovery
On runner startup:
- find all jobs with status
runningorpause_requested - mark those jobs
paused - find all
job_itemsleft inrunning - convert those items to
interrupted - record a recovery event
After that:
succeededitems remain donependingitems remain pendinginterrupteditems become eligible for retry or auto-requeue depending on stage policyfaileditems remain failed until explicit retry
This preserves progress without restarting the whole job or whole database.
Retry Rules
Single Item Retry
When the operator clicks retry for a failed or interrupted item:
- insert
job_commands.retry_item - clear execution fields on the target item
- set status back to
pending - increment
attempt_counton the next worker claim
Force Retry
Force retry is more aggressive:
- download stage may ignore an existing local mapping if the operator requests a fresh re-download
- upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload
Force retry must stay item-scoped, never job-scoped.
Disk Exhaustion Handling
If the downloader detects insufficient space:
- fail or interrupt the current download item
- pause the active job with a machine-readable reason such as
disk_full - surface a UI banner asking for a new library root override
After the operator supplies a new directory and clicks resume:
- the job continues only for unfinished items
- completed downloads are not restarted
- the currently failed song can be retried from scratch
This matches the requirement that one song may restart while the whole database must not restart.
Execution Strategy
Stage Executors
Implement separate executor paths for:
collectsyncdownloadupload
Recommended concurrency:
collect- low concurrency, v1 may stay serial
sync- low concurrency, v1 may stay serial
download- configurable worker pool
upload- configurable worker pool
Reuse Strategy
Prefer reusing current catalogsync modules:
musicdl.catalogsync.servicesmusicdl.catalogsync.downloadermusicdl.catalogsync.uploadermusicdl.catalogsync.repository
The runner should orchestrate these modules rather than rewriting the domain logic from scratch.
Technology Choice
Backend
Recommended stack:
FastAPIJinja2SQLiteSSEfor live updates
Frontend
Recommended rendering model:
- server-rendered pages with
Jinja2 HTMXfor partial updates and action forms- a small amount of vanilla JavaScript for log streaming and live worker refresh
Why this fits:
- NAS-local internal tool
- mainly operational tables and actions
- lower dependency and deployment complexity than a separate SPA
- easier to keep aligned with the existing Python-only project
Verification Plan
The implementation should be verified at four levels:
- unit tests
- state transitions
- retry rules
- recovery transforms
- API integration tests
- job creation
- pause and resume
- item retry
- config revision flow
- fault injection tests
- kill the runner mid-download and confirm item-level recovery
- NAS smoke tests
- create jobs
- pause and resume
- crash and restart
- retry a single failed song
- change library directory after disk-full pause
V1 Delivery Boundary
Must Ship In V1
- queue-based single-active-job runner
- supported job templates
- dashboard, job center, playlist pools, song processing, logs, and config pages
- soft pause and resume
- crash-safe item-level recovery
- single-item retry and force-retry
- env revision history and apply flow
Explicitly Deferred
- authentication
- multi-user permissions
- multiple active jobs
- distributed workers
- arbitrary stage composition
- automatic endless retries
- destructive file cleanup actions
Open Follow-Up Items
Two source-coverage follow-ups remain outside this console design and should stay tracked separately:
- redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results
- repair QQ playlist square collection after the old endpoint started returning
parameter failed
These belong to operational backlog work, not to the web console architecture itself.