725 lines
17 KiB
Markdown
725 lines
17 KiB
Markdown
# Catalogsync Operations Console Design
|
|
|
|
## Goal
|
|
|
|
Extend `musicdl.catalogsync` with a NAS-local web operations console that can:
|
|
|
|
- manage queue-based pipeline jobs for `collect`, `sync`, `download`, and `upload`
|
|
- show playlist pool and playlist execution status as `未完成 / 进行中 / 已完成 / 异常`
|
|
- show worker-level live processing state, especially which song each worker is handling
|
|
- support global soft pause and resume across all active workers
|
|
- survive process crashes or NAS restarts without restarting the whole catalog from scratch
|
|
- allow retrying a single failed or interrupted song/item instead of rerunning the whole database
|
|
- manage `catalogsync.env` as the primary operator configuration source
|
|
|
|
This design targets an internal NAS console, not a public-facing multi-user product.
|
|
|
|
## Scope
|
|
|
|
### In Scope
|
|
|
|
- Add a NAS-local web console for `catalogsync`
|
|
- Add a database-backed job queue with exactly one active job at a time
|
|
- Support these job templates:
|
|
- `全链路`
|
|
- `仅采集`
|
|
- `仅同步`
|
|
- `同步+下载`
|
|
- `仅下载`
|
|
- `仅上传`
|
|
- `下载+上传`
|
|
- Track job, stage, item, and worker state in SQLite
|
|
- Show dashboard, queue, playlist pool, worker, log, and config views
|
|
- Implement soft pause and resume
|
|
- Implement crash-safe recovery at job-item granularity
|
|
- Implement single-item retry and force-retry
|
|
- Version and edit `catalogsync.env` from the web console
|
|
- Reuse existing `musicdl.catalogsync` collectors, services, downloader, uploader, and storage model as much as possible
|
|
|
|
### Out of Scope
|
|
|
|
- Multi-user login or permissions
|
|
- Public internet exposure or hardened auth
|
|
- Multiple active jobs running at the same time
|
|
- Cross-machine worker distribution
|
|
- Arbitrary user-defined stage graphs
|
|
- Provider-specific cloud drive management beyond current object storage support
|
|
- Automatic deletion of local or remote files
|
|
- Editing business data such as songs or playlists directly from the UI
|
|
|
|
## Constraints
|
|
|
|
- The console runs on the NAS itself
|
|
- `catalogsync.env` remains the configuration source of truth
|
|
- A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work
|
|
- Recovery must resume from unfinished work items instead of rerunning all songs or all playlists
|
|
- Existing `musicdl.catalogsync` CLI and scripts must remain usable
|
|
- The first version should optimize for operational stability, inspectability, and recoverability over architecture purity
|
|
|
|
## Operator Model
|
|
|
|
### Deployment Model
|
|
|
|
The web console runs on the same NAS host that already owns:
|
|
|
|
- the SQLite database
|
|
- the local music library
|
|
- the logs directory
|
|
- the runtime scripts
|
|
- the object storage configuration
|
|
|
|
This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local.
|
|
|
|
### Configuration Model
|
|
|
|
`catalogsync.env` remains the operator-managed source of truth.
|
|
|
|
The console may:
|
|
|
|
- display current env values
|
|
- validate and save new env revisions
|
|
- apply a previous env revision as the current file
|
|
|
|
Queued jobs must store a `config_snapshot_json` copy of the relevant settings so:
|
|
|
|
- existing queued or running jobs stay deterministic
|
|
- later env edits only affect newly created jobs
|
|
|
|
## Recommended Architecture
|
|
|
|
Use four layers:
|
|
|
|
1. `Web Console`
|
|
- browser UI for dashboards, queue control, logs, and config management
|
|
2. `Management API`
|
|
- serves data and accepts job or config commands
|
|
3. `Job Orchestrator / Runner`
|
|
- single-process scheduler that owns queue progression, pause, resume, and recovery
|
|
4. `Existing Catalogsync Executors`
|
|
- reuse `collect`, `sync`, `download`, and `upload` behavior from current package modules
|
|
|
|
### Why Not A Thin Shell Wrapper
|
|
|
|
Wrapping only `download_all.sh` and `upload_all.sh` would not reliably provide:
|
|
|
|
- worker-level current song visibility
|
|
- item-level retry
|
|
- fine-grained recovery after process crashes
|
|
- stable soft pause and resume
|
|
|
|
The console therefore needs first-class job and work-item tables instead of depending only on raw shell output.
|
|
|
|
## Job Model
|
|
|
|
### Active Job Policy
|
|
|
|
- only one job may be `running` at a time
|
|
- additional jobs stay `queued`
|
|
- a paused job may later resume and reclaim the active slot
|
|
|
|
This keeps:
|
|
|
|
- pause and resume semantics simple
|
|
- resource ownership clear
|
|
- crash recovery easier to reason about
|
|
|
|
### Job Templates
|
|
|
|
Supported templates and stage chains:
|
|
|
|
- `全链路`
|
|
- `collect -> sync -> download -> upload`
|
|
- `仅采集`
|
|
- `collect`
|
|
- `仅同步`
|
|
- `sync`
|
|
- `同步+下载`
|
|
- `sync -> download`
|
|
- `仅下载`
|
|
- `download`
|
|
- `仅上传`
|
|
- `upload`
|
|
- `下载+上传`
|
|
- `download -> upload`
|
|
|
|
### Job Status
|
|
|
|
Recommended job statuses:
|
|
|
|
- `queued`
|
|
- `running`
|
|
- `pause_requested`
|
|
- `paused`
|
|
- `completed`
|
|
- `completed_with_errors`
|
|
- `failed`
|
|
- `canceled`
|
|
|
|
### Stage Status
|
|
|
|
Recommended stage statuses:
|
|
|
|
- `pending`
|
|
- `running`
|
|
- `pause_requested`
|
|
- `paused`
|
|
- `completed`
|
|
- `failed`
|
|
- `skipped`
|
|
|
|
### Work Item Status
|
|
|
|
Recommended item statuses:
|
|
|
|
- `pending`
|
|
- `running`
|
|
- `succeeded`
|
|
- `failed`
|
|
- `interrupted`
|
|
- `skipped`
|
|
- `canceled`
|
|
|
|
The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart.
|
|
|
|
## Data Model
|
|
|
|
### Existing Table Reuse
|
|
|
|
Keep current business tables as the catalog truth:
|
|
|
|
- `playlist_pools`
|
|
- `playlists`
|
|
- `pool_playlists`
|
|
- `songs`
|
|
- `playlist_songs`
|
|
- `artists`
|
|
- `song_artists`
|
|
- `file_locations`
|
|
- `object_storage_backends`
|
|
|
|
These continue to answer:
|
|
|
|
- what playlists exist
|
|
- what songs belong to each playlist
|
|
- which files exist locally or remotely
|
|
|
|
The new console layer adds execution truth around them.
|
|
|
|
### New Table: `job_runs`
|
|
|
|
Purpose:
|
|
|
|
- represent one queued or active operator job
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_type TEXT NOT NULL
|
|
status TEXT NOT NULL
|
|
priority INTEGER NOT NULL DEFAULT 100
|
|
requested_by TEXT
|
|
config_snapshot_json TEXT NOT NULL
|
|
sources TEXT
|
|
download_sources TEXT
|
|
playlist_scope_json TEXT
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
started_at TEXT
|
|
ended_at TEXT
|
|
last_error TEXT
|
|
resume_token TEXT
|
|
```
|
|
|
|
### New Table: `job_stages`
|
|
|
|
Purpose:
|
|
|
|
- track the stage-level execution status inside one job
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_run_id INTEGER NOT NULL
|
|
stage_type TEXT NOT NULL
|
|
status TEXT NOT NULL DEFAULT 'pending'
|
|
seq_no INTEGER NOT NULL
|
|
total_items INTEGER NOT NULL DEFAULT 0
|
|
pending_items INTEGER NOT NULL DEFAULT 0
|
|
running_items INTEGER NOT NULL DEFAULT 0
|
|
success_items INTEGER NOT NULL DEFAULT 0
|
|
failed_items INTEGER NOT NULL DEFAULT 0
|
|
skipped_items INTEGER NOT NULL DEFAULT 0
|
|
started_at TEXT
|
|
ended_at TEXT
|
|
last_error TEXT
|
|
```
|
|
|
|
### New Table: `job_items`
|
|
|
|
Purpose:
|
|
|
|
- track the real execution unit for recovery and retry
|
|
|
|
Granularity by stage:
|
|
|
|
- `collect`
|
|
- one pool/source fetch unit
|
|
- `sync`
|
|
- one playlist expansion unit
|
|
- `download`
|
|
- one song download unit
|
|
- `upload`
|
|
- one file upload unit
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_stage_id INTEGER NOT NULL
|
|
item_type TEXT NOT NULL
|
|
item_key TEXT NOT NULL
|
|
playlist_pool_id INTEGER
|
|
playlist_id INTEGER
|
|
song_id INTEGER
|
|
file_location_id INTEGER
|
|
status TEXT NOT NULL DEFAULT 'pending'
|
|
attempt_count INTEGER NOT NULL DEFAULT 0
|
|
max_attempts INTEGER NOT NULL DEFAULT 3
|
|
worker_id INTEGER
|
|
started_at TEXT
|
|
ended_at TEXT
|
|
last_error TEXT
|
|
last_error_code TEXT
|
|
payload_json TEXT
|
|
UNIQUE(job_stage_id, item_key)
|
|
```
|
|
|
|
### New Table: `job_workers`
|
|
|
|
Purpose:
|
|
|
|
- surface live worker state to the UI
|
|
- show which song each worker is processing
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_run_id INTEGER NOT NULL
|
|
job_stage_id INTEGER
|
|
worker_name TEXT NOT NULL
|
|
status TEXT NOT NULL DEFAULT 'idle'
|
|
current_job_item_id INTEGER
|
|
current_song_id INTEGER
|
|
current_playlist_id INTEGER
|
|
current_display_text TEXT
|
|
heartbeat_at TEXT
|
|
last_progress_text TEXT
|
|
processed_count INTEGER NOT NULL DEFAULT 0
|
|
error_count INTEGER NOT NULL DEFAULT 0
|
|
```
|
|
|
|
### New Table: `job_commands`
|
|
|
|
Purpose:
|
|
|
|
- safely bridge UI actions and runner behavior
|
|
|
|
Recommended command types:
|
|
|
|
- `pause`
|
|
- `resume`
|
|
- `cancel`
|
|
- `retry_item`
|
|
- `force_retry_item`
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_run_id INTEGER NOT NULL
|
|
command_type TEXT NOT NULL
|
|
target_item_id INTEGER
|
|
status TEXT NOT NULL DEFAULT 'pending'
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
applied_at TEXT
|
|
payload_json TEXT
|
|
```
|
|
|
|
### New Table: `job_events`
|
|
|
|
Purpose:
|
|
|
|
- structured audit trail for major runner events
|
|
|
|
Recommended event types include:
|
|
|
|
- `job_started`
|
|
- `stage_started`
|
|
- `item_started`
|
|
- `item_failed`
|
|
- `pause_requested`
|
|
- `resumed`
|
|
- `worker_heartbeat`
|
|
- `recovery_requeued`
|
|
|
|
### New Table: `job_logs`
|
|
|
|
Purpose:
|
|
|
|
- queryable log lines for the UI
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
job_run_id INTEGER NOT NULL
|
|
job_stage_id INTEGER
|
|
worker_id INTEGER
|
|
level TEXT NOT NULL
|
|
message TEXT NOT NULL
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
```
|
|
|
|
### New Table: `config_revisions`
|
|
|
|
Purpose:
|
|
|
|
- keep revision history of `catalogsync.env`
|
|
|
|
Recommended fields:
|
|
|
|
```text
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT
|
|
source_type TEXT NOT NULL DEFAULT 'env_file'
|
|
file_path TEXT NOT NULL
|
|
content_text TEXT NOT NULL
|
|
content_hash TEXT NOT NULL
|
|
created_at TEXT DEFAULT CURRENT_TIMESTAMP
|
|
applied_at TEXT
|
|
note TEXT
|
|
```
|
|
|
|
## UI Design
|
|
|
|
### Page 1: Dashboard
|
|
|
|
Show:
|
|
|
|
- current active job
|
|
- queue length
|
|
- downloaded song count
|
|
- uploaded file count
|
|
- failed item count
|
|
- per-stage summaries
|
|
- recent exceptions
|
|
- worker heartbeat overview
|
|
|
|
### Page 2: Job Center
|
|
|
|
Show:
|
|
|
|
- queued jobs
|
|
- running or paused job
|
|
- job template
|
|
- scope
|
|
- stage progression
|
|
- pause, resume, cancel controls
|
|
|
|
Allow:
|
|
|
|
- creating a new job from the supported templates
|
|
- changing priority of queued jobs if desired
|
|
|
|
### Page 3: Playlist Pools
|
|
|
|
Show:
|
|
|
|
- all playlist pools and playlists
|
|
- source platform
|
|
- pool kind
|
|
- song count
|
|
- downloaded count
|
|
- uploaded count
|
|
- main status
|
|
- current stage
|
|
- last processed time
|
|
- latest error summary
|
|
|
|
#### Derived Playlist Status Rules
|
|
|
|
Recommend deriving the main status as:
|
|
|
|
- `异常`
|
|
- any recent failed item exists for the playlist
|
|
- `进行中`
|
|
- any running or pause-requested item exists
|
|
- `未完成`
|
|
- unfinished items remain but the playlist is not actively processing
|
|
- `已完成`
|
|
- no unfinished item remains in the relevant pipeline scope
|
|
|
|
### Page 4: Song Processing
|
|
|
|
Show:
|
|
|
|
- each worker and its current song
|
|
- failed songs
|
|
- interrupted songs
|
|
- retryable items
|
|
|
|
Allow:
|
|
|
|
- retry single item
|
|
- force-retry single item
|
|
- filter by stage, platform, playlist, or error state
|
|
|
|
### Page 5: Logs And Exceptions
|
|
|
|
Show:
|
|
|
|
- structured events
|
|
- text logs
|
|
- job-level and item-level errors
|
|
- stack traces or HTTP error summaries where available
|
|
|
|
### Page 6: Config Management
|
|
|
|
Show:
|
|
|
|
- current `catalogsync.env`
|
|
- parsed effective values
|
|
- validation errors
|
|
- revision history
|
|
|
|
Allow:
|
|
|
|
- save a new env revision
|
|
- re-apply a previous revision
|
|
|
|
Rule:
|
|
|
|
- config edits affect only future jobs unless an explicit resume override is supplied
|
|
|
|
## API Surface
|
|
|
|
Recommended management endpoints:
|
|
|
|
- `GET /api/dashboard`
|
|
- `GET /api/jobs`
|
|
- `POST /api/jobs`
|
|
- `GET /api/jobs/{id}`
|
|
- `POST /api/jobs/{id}/pause`
|
|
- `POST /api/jobs/{id}/resume`
|
|
- `POST /api/jobs/{id}/cancel`
|
|
- `GET /api/jobs/{id}/items`
|
|
- `POST /api/job-items/{id}/retry`
|
|
- `POST /api/job-items/{id}/force-retry`
|
|
- `GET /api/workers`
|
|
- `GET /api/playlists`
|
|
- `GET /api/playlists/{id}`
|
|
- `GET /api/logs`
|
|
- `GET /api/config/env`
|
|
- `PUT /api/config/env`
|
|
- `GET /api/config/revisions`
|
|
- `POST /api/config/revisions/{id}/apply`
|
|
- `GET /api/events/stream`
|
|
|
|
`/api/events/stream` should use server-sent events so the dashboard and worker pages can refresh without polling every table separately.
|
|
|
|
## Pause, Resume, And Recovery Rules
|
|
|
|
### Soft Pause
|
|
|
|
The only supported pause mode in v1 is soft pause.
|
|
|
|
Behavior:
|
|
|
|
- UI inserts a `pause` command
|
|
- the runner marks the job and current stage as `pause_requested`
|
|
- workers stop claiming new items
|
|
- any in-progress item is allowed to finish naturally
|
|
- once all workers are idle, the stage becomes `paused` and then the job becomes `paused`
|
|
|
|
This avoids half-written file state and keeps item completion boundaries clean.
|
|
|
|
### Resume
|
|
|
|
Resume behavior:
|
|
|
|
- UI inserts a `resume` command
|
|
- the runner validates the job can continue
|
|
- the runner resets paused stage and job state back to `running`
|
|
- unstarted items stay `pending`
|
|
- succeeded items remain untouched
|
|
|
|
The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion.
|
|
|
|
### Crash Recovery
|
|
|
|
On runner startup:
|
|
|
|
1. find all jobs with status `running` or `pause_requested`
|
|
2. mark those jobs `paused`
|
|
3. find all `job_items` left in `running`
|
|
4. convert those items to `interrupted`
|
|
5. record a recovery event
|
|
|
|
After that:
|
|
|
|
- `succeeded` items remain done
|
|
- `pending` items remain pending
|
|
- `interrupted` items become eligible for retry or auto-requeue depending on stage policy
|
|
- `failed` items remain failed until explicit retry
|
|
|
|
This preserves progress without restarting the whole job or whole database.
|
|
|
|
## Retry Rules
|
|
|
|
### Single Item Retry
|
|
|
|
When the operator clicks retry for a failed or interrupted item:
|
|
|
|
- insert `job_commands.retry_item`
|
|
- clear execution fields on the target item
|
|
- set status back to `pending`
|
|
- increment `attempt_count` on the next worker claim
|
|
|
|
### Force Retry
|
|
|
|
Force retry is more aggressive:
|
|
|
|
- download stage may ignore an existing local mapping if the operator requests a fresh re-download
|
|
- upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload
|
|
|
|
Force retry must stay item-scoped, never job-scoped.
|
|
|
|
## Disk Exhaustion Handling
|
|
|
|
If the downloader detects insufficient space:
|
|
|
|
- fail or interrupt the current download item
|
|
- pause the active job with a machine-readable reason such as `disk_full`
|
|
- surface a UI banner asking for a new library root override
|
|
|
|
After the operator supplies a new directory and clicks resume:
|
|
|
|
- the job continues only for unfinished items
|
|
- completed downloads are not restarted
|
|
- the currently failed song can be retried from scratch
|
|
|
|
This matches the requirement that one song may restart while the whole database must not restart.
|
|
|
|
## Execution Strategy
|
|
|
|
### Stage Executors
|
|
|
|
Implement separate executor paths for:
|
|
|
|
- `collect`
|
|
- `sync`
|
|
- `download`
|
|
- `upload`
|
|
|
|
Recommended concurrency:
|
|
|
|
- `collect`
|
|
- low concurrency, v1 may stay serial
|
|
- `sync`
|
|
- low concurrency, v1 may stay serial
|
|
- `download`
|
|
- configurable worker pool
|
|
- `upload`
|
|
- configurable worker pool
|
|
|
|
### Reuse Strategy
|
|
|
|
Prefer reusing current catalogsync modules:
|
|
|
|
- `musicdl.catalogsync.services`
|
|
- `musicdl.catalogsync.downloader`
|
|
- `musicdl.catalogsync.uploader`
|
|
- `musicdl.catalogsync.repository`
|
|
|
|
The runner should orchestrate these modules rather than rewriting the domain logic from scratch.
|
|
|
|
## Technology Choice
|
|
|
|
### Backend
|
|
|
|
Recommended stack:
|
|
|
|
- `FastAPI`
|
|
- `Jinja2`
|
|
- `SQLite`
|
|
- `SSE` for live updates
|
|
|
|
### Frontend
|
|
|
|
Recommended rendering model:
|
|
|
|
- server-rendered pages with `Jinja2`
|
|
- `HTMX` for partial updates and action forms
|
|
- a small amount of vanilla JavaScript for log streaming and live worker refresh
|
|
|
|
Why this fits:
|
|
|
|
- NAS-local internal tool
|
|
- mainly operational tables and actions
|
|
- lower dependency and deployment complexity than a separate SPA
|
|
- easier to keep aligned with the existing Python-only project
|
|
|
|
## Verification Plan
|
|
|
|
The implementation should be verified at four levels:
|
|
|
|
1. unit tests
|
|
- state transitions
|
|
- retry rules
|
|
- recovery transforms
|
|
2. API integration tests
|
|
- job creation
|
|
- pause and resume
|
|
- item retry
|
|
- config revision flow
|
|
3. fault injection tests
|
|
- kill the runner mid-download and confirm item-level recovery
|
|
4. NAS smoke tests
|
|
- create jobs
|
|
- pause and resume
|
|
- crash and restart
|
|
- retry a single failed song
|
|
- change library directory after disk-full pause
|
|
|
|
## V1 Delivery Boundary
|
|
|
|
### Must Ship In V1
|
|
|
|
- queue-based single-active-job runner
|
|
- supported job templates
|
|
- dashboard, job center, playlist pools, song processing, logs, and config pages
|
|
- soft pause and resume
|
|
- crash-safe item-level recovery
|
|
- single-item retry and force-retry
|
|
- env revision history and apply flow
|
|
|
|
### Explicitly Deferred
|
|
|
|
- authentication
|
|
- multi-user permissions
|
|
- multiple active jobs
|
|
- distributed workers
|
|
- arbitrary stage composition
|
|
- automatic endless retries
|
|
- destructive file cleanup actions
|
|
|
|
## Open Follow-Up Items
|
|
|
|
Two source-coverage follow-ups remain outside this console design and should stay tracked separately:
|
|
|
|
- redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results
|
|
- repair QQ playlist square collection after the old endpoint started returning `parameter failed`
|
|
|
|
These belong to operational backlog work, not to the web console architecture itself.
|