musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-16-catalogsync-operations-console-design.md

# Catalogsync Operations Console Design

## Goal

Extend `musicdl.catalogsync` with a NAS-local web operations console that can:

- manage queue-based pipeline jobs for `collect`, `sync`, `download`, and `upload`
- show playlist pool and playlist execution status as `未完成 / 进行中 / 已完成 / 异常`
- show worker-level live processing state, especially which song each worker is handling
- support global soft pause and resume across all active workers
- survive process crashes or NAS restarts without restarting the whole catalog from scratch
- allow retrying a single failed or interrupted song/item instead of rerunning the whole database
- manage `catalogsync.env` as the primary operator configuration source

This design targets an internal NAS console, not a public-facing multi-user product.

## Scope

### In Scope

- Add a NAS-local web console for `catalogsync`
- Add a database-backed job queue with exactly one active job at a time
- Support these job templates:
  - `全链路`
  - `仅采集`
  - `仅同步`
  - `同步+下载`
  - `仅下载`
  - `仅上传`
  - `下载+上传`
- Track job, stage, item, and worker state in SQLite
- Show dashboard, queue, playlist pool, worker, log, and config views
- Implement soft pause and resume
- Implement crash-safe recovery at job-item granularity
- Implement single-item retry and force-retry
- Version and edit `catalogsync.env` from the web console
- Reuse existing `musicdl.catalogsync` collectors, services, downloader, uploader, and storage model as much as possible

### Out of Scope

- Multi-user login or permissions
- Public internet exposure or hardened auth
- Multiple active jobs running at the same time
- Cross-machine worker distribution
- Arbitrary user-defined stage graphs
- Provider-specific cloud drive management beyond current object storage support
- Automatic deletion of local or remote files
- Editing business data such as songs or playlists directly from the UI

## Constraints

- The console runs on the NAS itself
- `catalogsync.env` remains the configuration source of truth
- A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work
- Recovery must resume from unfinished work items instead of rerunning all songs or all playlists
- Existing `musicdl.catalogsync` CLI and scripts must remain usable
- The first version should optimize for operational stability, inspectability, and recoverability over architecture purity

## Operator Model

### Deployment Model

The web console runs on the same NAS host that already owns:

- the SQLite database
- the local music library
- the logs directory
- the runtime scripts
- the object storage configuration

This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local.

### Configuration Model

`catalogsync.env` remains the operator-managed source of truth.

The console may:

- display current env values
- validate and save new env revisions
- apply a previous env revision as the current file

Queued jobs must store a `config_snapshot_json` copy of the relevant settings so:

- existing queued or running jobs stay deterministic
- later env edits only affect newly created jobs

## Recommended Architecture

Use four layers:

1. `Web Console`
   - browser UI for dashboards, queue control, logs, and config management
2. `Management API`
   - serves data and accepts job or config commands
3. `Job Orchestrator / Runner`
   - single-process scheduler that owns queue progression, pause, resume, and recovery
4. `Existing Catalogsync Executors`
   - reuse `collect`, `sync`, `download`, and `upload` behavior from current package modules

### Why Not A Thin Shell Wrapper

Wrapping only `download_all.sh` and `upload_all.sh` would not reliably provide:

- worker-level current song visibility
- item-level retry
- fine-grained recovery after process crashes
- stable soft pause and resume

The console therefore needs first-class job and work-item tables instead of depending only on raw shell output.

## Job Model

### Active Job Policy

- only one job may be `running` at a time
- additional jobs stay `queued`
- a paused job may later resume and reclaim the active slot

This keeps:

- pause and resume semantics simple
- resource ownership clear
- crash recovery easier to reason about

### Job Templates

Supported templates and stage chains:

- `全链路`
  - `collect -> sync -> download -> upload`
- `仅采集`
  - `collect`
- `仅同步`
  - `sync`
- `同步+下载`
  - `sync -> download`
- `仅下载`
  - `download`
- `仅上传`
  - `upload`
- `下载+上传`
  - `download -> upload`

### Job Status

Recommended job statuses:

- `queued`
- `running`
- `pause_requested`
- `paused`
- `completed`
- `completed_with_errors`
- `failed`
- `canceled`

### Stage Status

Recommended stage statuses:

- `pending`
- `running`
- `pause_requested`
- `paused`
- `completed`
- `failed`
- `skipped`

### Work Item Status

Recommended item statuses:

- `pending`
- `running`
- `succeeded`
- `failed`
- `interrupted`
- `skipped`
- `canceled`

The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart.

## Data Model

### Existing Table Reuse

Keep current business tables as the catalog truth:

- `playlist_pools`
- `playlists`
- `pool_playlists`
- `songs`
- `playlist_songs`
- `artists`
- `song_artists`
- `file_locations`
- `object_storage_backends`

These continue to answer:

- what playlists exist
- what songs belong to each playlist
- which files exist locally or remotely

The new console layer adds execution truth around them.

### New Table: `job_runs`

Purpose:

- represent one queued or active operator job

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_type TEXT NOT NULL
status TEXT NOT NULL
priority INTEGER NOT NULL DEFAULT 100
requested_by TEXT
config_snapshot_json TEXT NOT NULL
sources TEXT
download_sources TEXT
playlist_scope_json TEXT
created_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
ended_at TEXT
last_error TEXT
resume_token TEXT
```

### New Table: `job_stages`

Purpose:

- track the stage-level execution status inside one job

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
stage_type TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
seq_no INTEGER NOT NULL
total_items INTEGER NOT NULL DEFAULT 0
pending_items INTEGER NOT NULL DEFAULT 0
running_items INTEGER NOT NULL DEFAULT 0
success_items INTEGER NOT NULL DEFAULT 0
failed_items INTEGER NOT NULL DEFAULT 0
skipped_items INTEGER NOT NULL DEFAULT 0
started_at TEXT
ended_at TEXT
last_error TEXT
```

### New Table: `job_items`

Purpose:

- track the real execution unit for recovery and retry

Granularity by stage:

- `collect`
  - one pool/source fetch unit
- `sync`
  - one playlist expansion unit
- `download`
  - one song download unit
- `upload`
  - one file upload unit

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_stage_id INTEGER NOT NULL
item_type TEXT NOT NULL
item_key TEXT NOT NULL
playlist_pool_id INTEGER
playlist_id INTEGER
song_id INTEGER
file_location_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
attempt_count INTEGER NOT NULL DEFAULT 0
max_attempts INTEGER NOT NULL DEFAULT 3
worker_id INTEGER
started_at TEXT
ended_at TEXT
last_error TEXT
last_error_code TEXT
payload_json TEXT
UNIQUE(job_stage_id, item_key)
```

### New Table: `job_workers`

Purpose:

- surface live worker state to the UI
- show which song each worker is processing

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_name TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'idle'
current_job_item_id INTEGER
current_song_id INTEGER
current_playlist_id INTEGER
current_display_text TEXT
heartbeat_at TEXT
last_progress_text TEXT
processed_count INTEGER NOT NULL DEFAULT 0
error_count INTEGER NOT NULL DEFAULT 0
```

### New Table: `job_commands`

Purpose:

- safely bridge UI actions and runner behavior

Recommended command types:

- `pause`
- `resume`
- `cancel`
- `retry_item`
- `force_retry_item`

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
command_type TEXT NOT NULL
target_item_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
payload_json TEXT
```

### New Table: `job_events`

Purpose:

- structured audit trail for major runner events

Recommended event types include:

- `job_started`
- `stage_started`
- `item_started`
- `item_failed`
- `pause_requested`
- `resumed`
- `worker_heartbeat`
- `recovery_requeued`

### New Table: `job_logs`

Purpose:

- queryable log lines for the UI

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_id INTEGER
level TEXT NOT NULL
message TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
```

### New Table: `config_revisions`

Purpose:

- keep revision history of `catalogsync.env`

Recommended fields:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
source_type TEXT NOT NULL DEFAULT 'env_file'
file_path TEXT NOT NULL
content_text TEXT NOT NULL
content_hash TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
note TEXT
```

## UI Design

### Page 1: Dashboard

Show:

- current active job
- queue length
- downloaded song count
- uploaded file count
- failed item count
- per-stage summaries
- recent exceptions
- worker heartbeat overview

### Page 2: Job Center

Show:

- queued jobs
- running or paused job
- job template
- scope
- stage progression
- pause, resume, cancel controls

Allow:

- creating a new job from the supported templates
- changing priority of queued jobs if desired

### Page 3: Playlist Pools

Show:

- all playlist pools and playlists
- source platform
- pool kind
- song count
- downloaded count
- uploaded count
- main status
- current stage
- last processed time
- latest error summary

#### Derived Playlist Status Rules

Recommend deriving the main status as:

- `异常`
  - any recent failed item exists for the playlist
- `进行中`
  - any running or pause-requested item exists
- `未完成`
  - unfinished items remain but the playlist is not actively processing
- `已完成`
  - no unfinished item remains in the relevant pipeline scope

### Page 4: Song Processing

Show:

- each worker and its current song
- failed songs
- interrupted songs
- retryable items

Allow:

- retry single item
- force-retry single item
- filter by stage, platform, playlist, or error state

### Page 5: Logs And Exceptions

Show:

- structured events
- text logs
- job-level and item-level errors
- stack traces or HTTP error summaries where available

### Page 6: Config Management

Show:

- current `catalogsync.env`
- parsed effective values
- validation errors
- revision history

Allow:

- save a new env revision
- re-apply a previous revision

Rule:

- config edits affect only future jobs unless an explicit resume override is supplied

## API Surface

Recommended management endpoints:

- `GET /api/dashboard`
- `GET /api/jobs`
- `POST /api/jobs`
- `GET /api/jobs/{id}`
- `POST /api/jobs/{id}/pause`
- `POST /api/jobs/{id}/resume`
- `POST /api/jobs/{id}/cancel`
- `GET /api/jobs/{id}/items`
- `POST /api/job-items/{id}/retry`
- `POST /api/job-items/{id}/force-retry`
- `GET /api/workers`
- `GET /api/playlists`
- `GET /api/playlists/{id}`
- `GET /api/logs`
- `GET /api/config/env`
- `PUT /api/config/env`
- `GET /api/config/revisions`
- `POST /api/config/revisions/{id}/apply`
- `GET /api/events/stream`

`/api/events/stream` should use server-sent events so the dashboard and worker pages can refresh without polling every table separately.

## Pause, Resume, And Recovery Rules

### Soft Pause

The only supported pause mode in v1 is soft pause.

Behavior:

- UI inserts a `pause` command
- the runner marks the job and current stage as `pause_requested`
- workers stop claiming new items
- any in-progress item is allowed to finish naturally
- once all workers are idle, the stage becomes `paused` and then the job becomes `paused`

This avoids half-written file state and keeps item completion boundaries clean.

### Resume

Resume behavior:

- UI inserts a `resume` command
- the runner validates the job can continue
- the runner resets paused stage and job state back to `running`
- unstarted items stay `pending`
- succeeded items remain untouched

The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion.

### Crash Recovery

On runner startup:

1. find all jobs with status `running` or `pause_requested`
2. mark those jobs `paused`
3. find all `job_items` left in `running`
4. convert those items to `interrupted`
5. record a recovery event

After that:

- `succeeded` items remain done
- `pending` items remain pending
- `interrupted` items become eligible for retry or auto-requeue depending on stage policy
- `failed` items remain failed until explicit retry

This preserves progress without restarting the whole job or whole database.

## Retry Rules

### Single Item Retry

When the operator clicks retry for a failed or interrupted item:

- insert `job_commands.retry_item`
- clear execution fields on the target item
- set status back to `pending`
- increment `attempt_count` on the next worker claim

### Force Retry

Force retry is more aggressive:

- download stage may ignore an existing local mapping if the operator requests a fresh re-download
- upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload

Force retry must stay item-scoped, never job-scoped.

## Disk Exhaustion Handling

If the downloader detects insufficient space:

- fail or interrupt the current download item
- pause the active job with a machine-readable reason such as `disk_full`
- surface a UI banner asking for a new library root override

After the operator supplies a new directory and clicks resume:

- the job continues only for unfinished items
- completed downloads are not restarted
- the currently failed song can be retried from scratch

This matches the requirement that one song may restart while the whole database must not restart.

## Execution Strategy

### Stage Executors

Implement separate executor paths for:

- `collect`
- `sync`
- `download`
- `upload`

Recommended concurrency:

- `collect`
  - low concurrency, v1 may stay serial
- `sync`
  - low concurrency, v1 may stay serial
- `download`
  - configurable worker pool
- `upload`
  - configurable worker pool

### Reuse Strategy

Prefer reusing current catalogsync modules:

- `musicdl.catalogsync.services`
- `musicdl.catalogsync.downloader`
- `musicdl.catalogsync.uploader`
- `musicdl.catalogsync.repository`

The runner should orchestrate these modules rather than rewriting the domain logic from scratch.

## Technology Choice

### Backend

Recommended stack:

- `FastAPI`
- `Jinja2`
- `SQLite`
- `SSE` for live updates

### Frontend

Recommended rendering model:

- server-rendered pages with `Jinja2`
- `HTMX` for partial updates and action forms
- a small amount of vanilla JavaScript for log streaming and live worker refresh

Why this fits:

- NAS-local internal tool
- mainly operational tables and actions
- lower dependency and deployment complexity than a separate SPA
- easier to keep aligned with the existing Python-only project

## Verification Plan

The implementation should be verified at four levels:

1. unit tests
   - state transitions
   - retry rules
   - recovery transforms
2. API integration tests
   - job creation
   - pause and resume
   - item retry
   - config revision flow
3. fault injection tests
   - kill the runner mid-download and confirm item-level recovery
4. NAS smoke tests
   - create jobs
   - pause and resume
   - crash and restart
   - retry a single failed song
   - change library directory after disk-full pause

## V1 Delivery Boundary

### Must Ship In V1

- queue-based single-active-job runner
- supported job templates
- dashboard, job center, playlist pools, song processing, logs, and config pages
- soft pause and resume
- crash-safe item-level recovery
- single-item retry and force-retry
- env revision history and apply flow

### Explicitly Deferred

- authentication
- multi-user permissions
- multiple active jobs
- distributed workers
- arbitrary stage composition
- automatic endless retries
- destructive file cleanup actions

## Open Follow-Up Items

Two source-coverage follow-ups remain outside this console design and should stay tracked separately:

- redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results
- repair QQ playlist square collection after the old endpoint started returning `parameter failed`

These belong to operational backlog work, not to the web console architecture itself.