# Catalogsync Operations Console Design ## Goal Extend `musicdl.catalogsync` with a NAS-local web operations console that can: - manage queue-based pipeline jobs for `collect`, `sync`, `download`, and `upload` - show playlist pool and playlist execution status as `未完成 / 进行中 / 已完成 / 异常` - show worker-level live processing state, especially which song each worker is handling - support global soft pause and resume across all active workers - survive process crashes or NAS restarts without restarting the whole catalog from scratch - allow retrying a single failed or interrupted song/item instead of rerunning the whole database - manage `catalogsync.env` as the primary operator configuration source This design targets an internal NAS console, not a public-facing multi-user product. ## Scope ### In Scope - Add a NAS-local web console for `catalogsync` - Add a database-backed job queue with exactly one active job at a time - Support these job templates: - `全链路` - `仅采集` - `仅同步` - `同步+下载` - `仅下载` - `仅上传` - `下载+上传` - Track job, stage, item, and worker state in SQLite - Show dashboard, queue, playlist pool, worker, log, and config views - Implement soft pause and resume - Implement crash-safe recovery at job-item granularity - Implement single-item retry and force-retry - Version and edit `catalogsync.env` from the web console - Reuse existing `musicdl.catalogsync` collectors, services, downloader, uploader, and storage model as much as possible ### Out of Scope - Multi-user login or permissions - Public internet exposure or hardened auth - Multiple active jobs running at the same time - Cross-machine worker distribution - Arbitrary user-defined stage graphs - Provider-specific cloud drive management beyond current object storage support - Automatic deletion of local or remote files - Editing business data such as songs or playlists directly from the UI ## Constraints - The console runs on the NAS itself - `catalogsync.env` remains the configuration source of truth - A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work - Recovery must resume from unfinished work items instead of rerunning all songs or all playlists - Existing `musicdl.catalogsync` CLI and scripts must remain usable - The first version should optimize for operational stability, inspectability, and recoverability over architecture purity ## Operator Model ### Deployment Model The web console runs on the same NAS host that already owns: - the SQLite database - the local music library - the logs directory - the runtime scripts - the object storage configuration This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local. ### Configuration Model `catalogsync.env` remains the operator-managed source of truth. The console may: - display current env values - validate and save new env revisions - apply a previous env revision as the current file Queued jobs must store a `config_snapshot_json` copy of the relevant settings so: - existing queued or running jobs stay deterministic - later env edits only affect newly created jobs ## Recommended Architecture Use four layers: 1. `Web Console` - browser UI for dashboards, queue control, logs, and config management 2. `Management API` - serves data and accepts job or config commands 3. `Job Orchestrator / Runner` - single-process scheduler that owns queue progression, pause, resume, and recovery 4. `Existing Catalogsync Executors` - reuse `collect`, `sync`, `download`, and `upload` behavior from current package modules ### Why Not A Thin Shell Wrapper Wrapping only `download_all.sh` and `upload_all.sh` would not reliably provide: - worker-level current song visibility - item-level retry - fine-grained recovery after process crashes - stable soft pause and resume The console therefore needs first-class job and work-item tables instead of depending only on raw shell output. ## Job Model ### Active Job Policy - only one job may be `running` at a time - additional jobs stay `queued` - a paused job may later resume and reclaim the active slot This keeps: - pause and resume semantics simple - resource ownership clear - crash recovery easier to reason about ### Job Templates Supported templates and stage chains: - `全链路` - `collect -> sync -> download -> upload` - `仅采集` - `collect` - `仅同步` - `sync` - `同步+下载` - `sync -> download` - `仅下载` - `download` - `仅上传` - `upload` - `下载+上传` - `download -> upload` ### Job Status Recommended job statuses: - `queued` - `running` - `pause_requested` - `paused` - `completed` - `completed_with_errors` - `failed` - `canceled` ### Stage Status Recommended stage statuses: - `pending` - `running` - `pause_requested` - `paused` - `completed` - `failed` - `skipped` ### Work Item Status Recommended item statuses: - `pending` - `running` - `succeeded` - `failed` - `interrupted` - `skipped` - `canceled` The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart. ## Data Model ### Existing Table Reuse Keep current business tables as the catalog truth: - `playlist_pools` - `playlists` - `pool_playlists` - `songs` - `playlist_songs` - `artists` - `song_artists` - `file_locations` - `object_storage_backends` These continue to answer: - what playlists exist - what songs belong to each playlist - which files exist locally or remotely The new console layer adds execution truth around them. ### New Table: `job_runs` Purpose: - represent one queued or active operator job Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_type TEXT NOT NULL status TEXT NOT NULL priority INTEGER NOT NULL DEFAULT 100 requested_by TEXT config_snapshot_json TEXT NOT NULL sources TEXT download_sources TEXT playlist_scope_json TEXT created_at TEXT DEFAULT CURRENT_TIMESTAMP started_at TEXT ended_at TEXT last_error TEXT resume_token TEXT ``` ### New Table: `job_stages` Purpose: - track the stage-level execution status inside one job Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_run_id INTEGER NOT NULL stage_type TEXT NOT NULL status TEXT NOT NULL DEFAULT 'pending' seq_no INTEGER NOT NULL total_items INTEGER NOT NULL DEFAULT 0 pending_items INTEGER NOT NULL DEFAULT 0 running_items INTEGER NOT NULL DEFAULT 0 success_items INTEGER NOT NULL DEFAULT 0 failed_items INTEGER NOT NULL DEFAULT 0 skipped_items INTEGER NOT NULL DEFAULT 0 started_at TEXT ended_at TEXT last_error TEXT ``` ### New Table: `job_items` Purpose: - track the real execution unit for recovery and retry Granularity by stage: - `collect` - one pool/source fetch unit - `sync` - one playlist expansion unit - `download` - one song download unit - `upload` - one file upload unit Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_stage_id INTEGER NOT NULL item_type TEXT NOT NULL item_key TEXT NOT NULL playlist_pool_id INTEGER playlist_id INTEGER song_id INTEGER file_location_id INTEGER status TEXT NOT NULL DEFAULT 'pending' attempt_count INTEGER NOT NULL DEFAULT 0 max_attempts INTEGER NOT NULL DEFAULT 3 worker_id INTEGER started_at TEXT ended_at TEXT last_error TEXT last_error_code TEXT payload_json TEXT UNIQUE(job_stage_id, item_key) ``` ### New Table: `job_workers` Purpose: - surface live worker state to the UI - show which song each worker is processing Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_run_id INTEGER NOT NULL job_stage_id INTEGER worker_name TEXT NOT NULL status TEXT NOT NULL DEFAULT 'idle' current_job_item_id INTEGER current_song_id INTEGER current_playlist_id INTEGER current_display_text TEXT heartbeat_at TEXT last_progress_text TEXT processed_count INTEGER NOT NULL DEFAULT 0 error_count INTEGER NOT NULL DEFAULT 0 ``` ### New Table: `job_commands` Purpose: - safely bridge UI actions and runner behavior Recommended command types: - `pause` - `resume` - `cancel` - `retry_item` - `force_retry_item` Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_run_id INTEGER NOT NULL command_type TEXT NOT NULL target_item_id INTEGER status TEXT NOT NULL DEFAULT 'pending' created_at TEXT DEFAULT CURRENT_TIMESTAMP applied_at TEXT payload_json TEXT ``` ### New Table: `job_events` Purpose: - structured audit trail for major runner events Recommended event types include: - `job_started` - `stage_started` - `item_started` - `item_failed` - `pause_requested` - `resumed` - `worker_heartbeat` - `recovery_requeued` ### New Table: `job_logs` Purpose: - queryable log lines for the UI Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT job_run_id INTEGER NOT NULL job_stage_id INTEGER worker_id INTEGER level TEXT NOT NULL message TEXT NOT NULL created_at TEXT DEFAULT CURRENT_TIMESTAMP ``` ### New Table: `config_revisions` Purpose: - keep revision history of `catalogsync.env` Recommended fields: ```text id INTEGER PRIMARY KEY AUTOINCREMENT source_type TEXT NOT NULL DEFAULT 'env_file' file_path TEXT NOT NULL content_text TEXT NOT NULL content_hash TEXT NOT NULL created_at TEXT DEFAULT CURRENT_TIMESTAMP applied_at TEXT note TEXT ``` ## UI Design ### Page 1: Dashboard Show: - current active job - queue length - downloaded song count - uploaded file count - failed item count - per-stage summaries - recent exceptions - worker heartbeat overview ### Page 2: Job Center Show: - queued jobs - running or paused job - job template - scope - stage progression - pause, resume, cancel controls Allow: - creating a new job from the supported templates - changing priority of queued jobs if desired ### Page 3: Playlist Pools Show: - all playlist pools and playlists - source platform - pool kind - song count - downloaded count - uploaded count - main status - current stage - last processed time - latest error summary #### Derived Playlist Status Rules Recommend deriving the main status as: - `异常` - any recent failed item exists for the playlist - `进行中` - any running or pause-requested item exists - `未完成` - unfinished items remain but the playlist is not actively processing - `已完成` - no unfinished item remains in the relevant pipeline scope ### Page 4: Song Processing Show: - each worker and its current song - failed songs - interrupted songs - retryable items Allow: - retry single item - force-retry single item - filter by stage, platform, playlist, or error state ### Page 5: Logs And Exceptions Show: - structured events - text logs - job-level and item-level errors - stack traces or HTTP error summaries where available ### Page 6: Config Management Show: - current `catalogsync.env` - parsed effective values - validation errors - revision history Allow: - save a new env revision - re-apply a previous revision Rule: - config edits affect only future jobs unless an explicit resume override is supplied ## API Surface Recommended management endpoints: - `GET /api/dashboard` - `GET /api/jobs` - `POST /api/jobs` - `GET /api/jobs/{id}` - `POST /api/jobs/{id}/pause` - `POST /api/jobs/{id}/resume` - `POST /api/jobs/{id}/cancel` - `GET /api/jobs/{id}/items` - `POST /api/job-items/{id}/retry` - `POST /api/job-items/{id}/force-retry` - `GET /api/workers` - `GET /api/playlists` - `GET /api/playlists/{id}` - `GET /api/logs` - `GET /api/config/env` - `PUT /api/config/env` - `GET /api/config/revisions` - `POST /api/config/revisions/{id}/apply` - `GET /api/events/stream` `/api/events/stream` should use server-sent events so the dashboard and worker pages can refresh without polling every table separately. ## Pause, Resume, And Recovery Rules ### Soft Pause The only supported pause mode in v1 is soft pause. Behavior: - UI inserts a `pause` command - the runner marks the job and current stage as `pause_requested` - workers stop claiming new items - any in-progress item is allowed to finish naturally - once all workers are idle, the stage becomes `paused` and then the job becomes `paused` This avoids half-written file state and keeps item completion boundaries clean. ### Resume Resume behavior: - UI inserts a `resume` command - the runner validates the job can continue - the runner resets paused stage and job state back to `running` - unstarted items stay `pending` - succeeded items remain untouched The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion. ### Crash Recovery On runner startup: 1. find all jobs with status `running` or `pause_requested` 2. mark those jobs `paused` 3. find all `job_items` left in `running` 4. convert those items to `interrupted` 5. record a recovery event After that: - `succeeded` items remain done - `pending` items remain pending - `interrupted` items become eligible for retry or auto-requeue depending on stage policy - `failed` items remain failed until explicit retry This preserves progress without restarting the whole job or whole database. ## Retry Rules ### Single Item Retry When the operator clicks retry for a failed or interrupted item: - insert `job_commands.retry_item` - clear execution fields on the target item - set status back to `pending` - increment `attempt_count` on the next worker claim ### Force Retry Force retry is more aggressive: - download stage may ignore an existing local mapping if the operator requests a fresh re-download - upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload Force retry must stay item-scoped, never job-scoped. ## Disk Exhaustion Handling If the downloader detects insufficient space: - fail or interrupt the current download item - pause the active job with a machine-readable reason such as `disk_full` - surface a UI banner asking for a new library root override After the operator supplies a new directory and clicks resume: - the job continues only for unfinished items - completed downloads are not restarted - the currently failed song can be retried from scratch This matches the requirement that one song may restart while the whole database must not restart. ## Execution Strategy ### Stage Executors Implement separate executor paths for: - `collect` - `sync` - `download` - `upload` Recommended concurrency: - `collect` - low concurrency, v1 may stay serial - `sync` - low concurrency, v1 may stay serial - `download` - configurable worker pool - `upload` - configurable worker pool ### Reuse Strategy Prefer reusing current catalogsync modules: - `musicdl.catalogsync.services` - `musicdl.catalogsync.downloader` - `musicdl.catalogsync.uploader` - `musicdl.catalogsync.repository` The runner should orchestrate these modules rather than rewriting the domain logic from scratch. ## Technology Choice ### Backend Recommended stack: - `FastAPI` - `Jinja2` - `SQLite` - `SSE` for live updates ### Frontend Recommended rendering model: - server-rendered pages with `Jinja2` - `HTMX` for partial updates and action forms - a small amount of vanilla JavaScript for log streaming and live worker refresh Why this fits: - NAS-local internal tool - mainly operational tables and actions - lower dependency and deployment complexity than a separate SPA - easier to keep aligned with the existing Python-only project ## Verification Plan The implementation should be verified at four levels: 1. unit tests - state transitions - retry rules - recovery transforms 2. API integration tests - job creation - pause and resume - item retry - config revision flow 3. fault injection tests - kill the runner mid-download and confirm item-level recovery 4. NAS smoke tests - create jobs - pause and resume - crash and restart - retry a single failed song - change library directory after disk-full pause ## V1 Delivery Boundary ### Must Ship In V1 - queue-based single-active-job runner - supported job templates - dashboard, job center, playlist pools, song processing, logs, and config pages - soft pause and resume - crash-safe item-level recovery - single-item retry and force-retry - env revision history and apply flow ### Explicitly Deferred - authentication - multi-user permissions - multiple active jobs - distributed workers - arbitrary stage composition - automatic endless retries - destructive file cleanup actions ## Open Follow-Up Items Two source-coverage follow-ups remain outside this console design and should stay tracked separately: - redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results - repair QQ playlist square collection after the old endpoint started returning `parameter failed` These belong to operational backlog work, not to the web console architecture itself.