Files
musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-16-catalogsync-operations-console-design.md
T

725 lines
17 KiB
Markdown

# Catalogsync Operations Console Design
## Goal
Extend `musicdl.catalogsync` with a NAS-local web operations console that can:
- manage queue-based pipeline jobs for `collect`, `sync`, `download`, and `upload`
- show playlist pool and playlist execution status as `未完成 / 进行中 / 已完成 / 异常`
- show worker-level live processing state, especially which song each worker is handling
- support global soft pause and resume across all active workers
- survive process crashes or NAS restarts without restarting the whole catalog from scratch
- allow retrying a single failed or interrupted song/item instead of rerunning the whole database
- manage `catalogsync.env` as the primary operator configuration source
This design targets an internal NAS console, not a public-facing multi-user product.
## Scope
### In Scope
- Add a NAS-local web console for `catalogsync`
- Add a database-backed job queue with exactly one active job at a time
- Support these job templates:
- `全链路`
- `仅采集`
- `仅同步`
- `同步+下载`
- `仅下载`
- `仅上传`
- `下载+上传`
- Track job, stage, item, and worker state in SQLite
- Show dashboard, queue, playlist pool, worker, log, and config views
- Implement soft pause and resume
- Implement crash-safe recovery at job-item granularity
- Implement single-item retry and force-retry
- Version and edit `catalogsync.env` from the web console
- Reuse existing `musicdl.catalogsync` collectors, services, downloader, uploader, and storage model as much as possible
### Out of Scope
- Multi-user login or permissions
- Public internet exposure or hardened auth
- Multiple active jobs running at the same time
- Cross-machine worker distribution
- Arbitrary user-defined stage graphs
- Provider-specific cloud drive management beyond current object storage support
- Automatic deletion of local or remote files
- Editing business data such as songs or playlists directly from the UI
## Constraints
- The console runs on the NAS itself
- `catalogsync.env` remains the configuration source of truth
- A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work
- Recovery must resume from unfinished work items instead of rerunning all songs or all playlists
- Existing `musicdl.catalogsync` CLI and scripts must remain usable
- The first version should optimize for operational stability, inspectability, and recoverability over architecture purity
## Operator Model
### Deployment Model
The web console runs on the same NAS host that already owns:
- the SQLite database
- the local music library
- the logs directory
- the runtime scripts
- the object storage configuration
This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local.
### Configuration Model
`catalogsync.env` remains the operator-managed source of truth.
The console may:
- display current env values
- validate and save new env revisions
- apply a previous env revision as the current file
Queued jobs must store a `config_snapshot_json` copy of the relevant settings so:
- existing queued or running jobs stay deterministic
- later env edits only affect newly created jobs
## Recommended Architecture
Use four layers:
1. `Web Console`
- browser UI for dashboards, queue control, logs, and config management
2. `Management API`
- serves data and accepts job or config commands
3. `Job Orchestrator / Runner`
- single-process scheduler that owns queue progression, pause, resume, and recovery
4. `Existing Catalogsync Executors`
- reuse `collect`, `sync`, `download`, and `upload` behavior from current package modules
### Why Not A Thin Shell Wrapper
Wrapping only `download_all.sh` and `upload_all.sh` would not reliably provide:
- worker-level current song visibility
- item-level retry
- fine-grained recovery after process crashes
- stable soft pause and resume
The console therefore needs first-class job and work-item tables instead of depending only on raw shell output.
## Job Model
### Active Job Policy
- only one job may be `running` at a time
- additional jobs stay `queued`
- a paused job may later resume and reclaim the active slot
This keeps:
- pause and resume semantics simple
- resource ownership clear
- crash recovery easier to reason about
### Job Templates
Supported templates and stage chains:
- `全链路`
- `collect -> sync -> download -> upload`
- `仅采集`
- `collect`
- `仅同步`
- `sync`
- `同步+下载`
- `sync -> download`
- `仅下载`
- `download`
- `仅上传`
- `upload`
- `下载+上传`
- `download -> upload`
### Job Status
Recommended job statuses:
- `queued`
- `running`
- `pause_requested`
- `paused`
- `completed`
- `completed_with_errors`
- `failed`
- `canceled`
### Stage Status
Recommended stage statuses:
- `pending`
- `running`
- `pause_requested`
- `paused`
- `completed`
- `failed`
- `skipped`
### Work Item Status
Recommended item statuses:
- `pending`
- `running`
- `succeeded`
- `failed`
- `interrupted`
- `skipped`
- `canceled`
The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart.
## Data Model
### Existing Table Reuse
Keep current business tables as the catalog truth:
- `playlist_pools`
- `playlists`
- `pool_playlists`
- `songs`
- `playlist_songs`
- `artists`
- `song_artists`
- `file_locations`
- `object_storage_backends`
These continue to answer:
- what playlists exist
- what songs belong to each playlist
- which files exist locally or remotely
The new console layer adds execution truth around them.
### New Table: `job_runs`
Purpose:
- represent one queued or active operator job
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_type TEXT NOT NULL
status TEXT NOT NULL
priority INTEGER NOT NULL DEFAULT 100
requested_by TEXT
config_snapshot_json TEXT NOT NULL
sources TEXT
download_sources TEXT
playlist_scope_json TEXT
created_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
ended_at TEXT
last_error TEXT
resume_token TEXT
```
### New Table: `job_stages`
Purpose:
- track the stage-level execution status inside one job
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
stage_type TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
seq_no INTEGER NOT NULL
total_items INTEGER NOT NULL DEFAULT 0
pending_items INTEGER NOT NULL DEFAULT 0
running_items INTEGER NOT NULL DEFAULT 0
success_items INTEGER NOT NULL DEFAULT 0
failed_items INTEGER NOT NULL DEFAULT 0
skipped_items INTEGER NOT NULL DEFAULT 0
started_at TEXT
ended_at TEXT
last_error TEXT
```
### New Table: `job_items`
Purpose:
- track the real execution unit for recovery and retry
Granularity by stage:
- `collect`
- one pool/source fetch unit
- `sync`
- one playlist expansion unit
- `download`
- one song download unit
- `upload`
- one file upload unit
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_stage_id INTEGER NOT NULL
item_type TEXT NOT NULL
item_key TEXT NOT NULL
playlist_pool_id INTEGER
playlist_id INTEGER
song_id INTEGER
file_location_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
attempt_count INTEGER NOT NULL DEFAULT 0
max_attempts INTEGER NOT NULL DEFAULT 3
worker_id INTEGER
started_at TEXT
ended_at TEXT
last_error TEXT
last_error_code TEXT
payload_json TEXT
UNIQUE(job_stage_id, item_key)
```
### New Table: `job_workers`
Purpose:
- surface live worker state to the UI
- show which song each worker is processing
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_name TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'idle'
current_job_item_id INTEGER
current_song_id INTEGER
current_playlist_id INTEGER
current_display_text TEXT
heartbeat_at TEXT
last_progress_text TEXT
processed_count INTEGER NOT NULL DEFAULT 0
error_count INTEGER NOT NULL DEFAULT 0
```
### New Table: `job_commands`
Purpose:
- safely bridge UI actions and runner behavior
Recommended command types:
- `pause`
- `resume`
- `cancel`
- `retry_item`
- `force_retry_item`
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
command_type TEXT NOT NULL
target_item_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
payload_json TEXT
```
### New Table: `job_events`
Purpose:
- structured audit trail for major runner events
Recommended event types include:
- `job_started`
- `stage_started`
- `item_started`
- `item_failed`
- `pause_requested`
- `resumed`
- `worker_heartbeat`
- `recovery_requeued`
### New Table: `job_logs`
Purpose:
- queryable log lines for the UI
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_id INTEGER
level TEXT NOT NULL
message TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
```
### New Table: `config_revisions`
Purpose:
- keep revision history of `catalogsync.env`
Recommended fields:
```text
id INTEGER PRIMARY KEY AUTOINCREMENT
source_type TEXT NOT NULL DEFAULT 'env_file'
file_path TEXT NOT NULL
content_text TEXT NOT NULL
content_hash TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
note TEXT
```
## UI Design
### Page 1: Dashboard
Show:
- current active job
- queue length
- downloaded song count
- uploaded file count
- failed item count
- per-stage summaries
- recent exceptions
- worker heartbeat overview
### Page 2: Job Center
Show:
- queued jobs
- running or paused job
- job template
- scope
- stage progression
- pause, resume, cancel controls
Allow:
- creating a new job from the supported templates
- changing priority of queued jobs if desired
### Page 3: Playlist Pools
Show:
- all playlist pools and playlists
- source platform
- pool kind
- song count
- downloaded count
- uploaded count
- main status
- current stage
- last processed time
- latest error summary
#### Derived Playlist Status Rules
Recommend deriving the main status as:
- `异常`
- any recent failed item exists for the playlist
- `进行中`
- any running or pause-requested item exists
- `未完成`
- unfinished items remain but the playlist is not actively processing
- `已完成`
- no unfinished item remains in the relevant pipeline scope
### Page 4: Song Processing
Show:
- each worker and its current song
- failed songs
- interrupted songs
- retryable items
Allow:
- retry single item
- force-retry single item
- filter by stage, platform, playlist, or error state
### Page 5: Logs And Exceptions
Show:
- structured events
- text logs
- job-level and item-level errors
- stack traces or HTTP error summaries where available
### Page 6: Config Management
Show:
- current `catalogsync.env`
- parsed effective values
- validation errors
- revision history
Allow:
- save a new env revision
- re-apply a previous revision
Rule:
- config edits affect only future jobs unless an explicit resume override is supplied
## API Surface
Recommended management endpoints:
- `GET /api/dashboard`
- `GET /api/jobs`
- `POST /api/jobs`
- `GET /api/jobs/{id}`
- `POST /api/jobs/{id}/pause`
- `POST /api/jobs/{id}/resume`
- `POST /api/jobs/{id}/cancel`
- `GET /api/jobs/{id}/items`
- `POST /api/job-items/{id}/retry`
- `POST /api/job-items/{id}/force-retry`
- `GET /api/workers`
- `GET /api/playlists`
- `GET /api/playlists/{id}`
- `GET /api/logs`
- `GET /api/config/env`
- `PUT /api/config/env`
- `GET /api/config/revisions`
- `POST /api/config/revisions/{id}/apply`
- `GET /api/events/stream`
`/api/events/stream` should use server-sent events so the dashboard and worker pages can refresh without polling every table separately.
## Pause, Resume, And Recovery Rules
### Soft Pause
The only supported pause mode in v1 is soft pause.
Behavior:
- UI inserts a `pause` command
- the runner marks the job and current stage as `pause_requested`
- workers stop claiming new items
- any in-progress item is allowed to finish naturally
- once all workers are idle, the stage becomes `paused` and then the job becomes `paused`
This avoids half-written file state and keeps item completion boundaries clean.
### Resume
Resume behavior:
- UI inserts a `resume` command
- the runner validates the job can continue
- the runner resets paused stage and job state back to `running`
- unstarted items stay `pending`
- succeeded items remain untouched
The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion.
### Crash Recovery
On runner startup:
1. find all jobs with status `running` or `pause_requested`
2. mark those jobs `paused`
3. find all `job_items` left in `running`
4. convert those items to `interrupted`
5. record a recovery event
After that:
- `succeeded` items remain done
- `pending` items remain pending
- `interrupted` items become eligible for retry or auto-requeue depending on stage policy
- `failed` items remain failed until explicit retry
This preserves progress without restarting the whole job or whole database.
## Retry Rules
### Single Item Retry
When the operator clicks retry for a failed or interrupted item:
- insert `job_commands.retry_item`
- clear execution fields on the target item
- set status back to `pending`
- increment `attempt_count` on the next worker claim
### Force Retry
Force retry is more aggressive:
- download stage may ignore an existing local mapping if the operator requests a fresh re-download
- upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload
Force retry must stay item-scoped, never job-scoped.
## Disk Exhaustion Handling
If the downloader detects insufficient space:
- fail or interrupt the current download item
- pause the active job with a machine-readable reason such as `disk_full`
- surface a UI banner asking for a new library root override
After the operator supplies a new directory and clicks resume:
- the job continues only for unfinished items
- completed downloads are not restarted
- the currently failed song can be retried from scratch
This matches the requirement that one song may restart while the whole database must not restart.
## Execution Strategy
### Stage Executors
Implement separate executor paths for:
- `collect`
- `sync`
- `download`
- `upload`
Recommended concurrency:
- `collect`
- low concurrency, v1 may stay serial
- `sync`
- low concurrency, v1 may stay serial
- `download`
- configurable worker pool
- `upload`
- configurable worker pool
### Reuse Strategy
Prefer reusing current catalogsync modules:
- `musicdl.catalogsync.services`
- `musicdl.catalogsync.downloader`
- `musicdl.catalogsync.uploader`
- `musicdl.catalogsync.repository`
The runner should orchestrate these modules rather than rewriting the domain logic from scratch.
## Technology Choice
### Backend
Recommended stack:
- `FastAPI`
- `Jinja2`
- `SQLite`
- `SSE` for live updates
### Frontend
Recommended rendering model:
- server-rendered pages with `Jinja2`
- `HTMX` for partial updates and action forms
- a small amount of vanilla JavaScript for log streaming and live worker refresh
Why this fits:
- NAS-local internal tool
- mainly operational tables and actions
- lower dependency and deployment complexity than a separate SPA
- easier to keep aligned with the existing Python-only project
## Verification Plan
The implementation should be verified at four levels:
1. unit tests
- state transitions
- retry rules
- recovery transforms
2. API integration tests
- job creation
- pause and resume
- item retry
- config revision flow
3. fault injection tests
- kill the runner mid-download and confirm item-level recovery
4. NAS smoke tests
- create jobs
- pause and resume
- crash and restart
- retry a single failed song
- change library directory after disk-full pause
## V1 Delivery Boundary
### Must Ship In V1
- queue-based single-active-job runner
- supported job templates
- dashboard, job center, playlist pools, song processing, logs, and config pages
- soft pause and resume
- crash-safe item-level recovery
- single-item retry and force-retry
- env revision history and apply flow
### Explicitly Deferred
- authentication
- multi-user permissions
- multiple active jobs
- distributed workers
- arbitrary stage composition
- automatic endless retries
- destructive file cleanup actions
## Open Follow-Up Items
Two source-coverage follow-ups remain outside this console design and should stay tracked separately:
- redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results
- repair QQ playlist square collection after the old endpoint started returning `parameter failed`
These belong to operational backlog work, not to the web console architecture itself.