Files
musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-16-catalogsync-operations-console-design.md

17 KiB

Catalogsync Operations Console Design

Goal

Extend musicdl.catalogsync with a NAS-local web operations console that can:

  • manage queue-based pipeline jobs for collect, sync, download, and upload
  • show playlist pool and playlist execution status as 未完成 / 进行中 / 已完成 / 异常
  • show worker-level live processing state, especially which song each worker is handling
  • support global soft pause and resume across all active workers
  • survive process crashes or NAS restarts without restarting the whole catalog from scratch
  • allow retrying a single failed or interrupted song/item instead of rerunning the whole database
  • manage catalogsync.env as the primary operator configuration source

This design targets an internal NAS console, not a public-facing multi-user product.

Scope

In Scope

  • Add a NAS-local web console for catalogsync
  • Add a database-backed job queue with exactly one active job at a time
  • Support these job templates:
    • 全链路
    • 仅采集
    • 仅同步
    • 同步+下载
    • 仅下载
    • 仅上传
    • 下载+上传
  • Track job, stage, item, and worker state in SQLite
  • Show dashboard, queue, playlist pool, worker, log, and config views
  • Implement soft pause and resume
  • Implement crash-safe recovery at job-item granularity
  • Implement single-item retry and force-retry
  • Version and edit catalogsync.env from the web console
  • Reuse existing musicdl.catalogsync collectors, services, downloader, uploader, and storage model as much as possible

Out of Scope

  • Multi-user login or permissions
  • Public internet exposure or hardened auth
  • Multiple active jobs running at the same time
  • Cross-machine worker distribution
  • Arbitrary user-defined stage graphs
  • Provider-specific cloud drive management beyond current object storage support
  • Automatic deletion of local or remote files
  • Editing business data such as songs or playlists directly from the UI

Constraints

  • The console runs on the NAS itself
  • catalogsync.env remains the configuration source of truth
  • A queued job must freeze the required runtime settings into a config snapshot so later env edits do not mutate in-flight work
  • Recovery must resume from unfinished work items instead of rerunning all songs or all playlists
  • Existing musicdl.catalogsync CLI and scripts must remain usable
  • The first version should optimize for operational stability, inspectability, and recoverability over architecture purity

Operator Model

Deployment Model

The web console runs on the same NAS host that already owns:

  • the SQLite database
  • the local music library
  • the logs directory
  • the runtime scripts
  • the object storage configuration

This avoids a remote-control architecture for v1 and keeps job control, log access, file state, and recovery local.

Configuration Model

catalogsync.env remains the operator-managed source of truth.

The console may:

  • display current env values
  • validate and save new env revisions
  • apply a previous env revision as the current file

Queued jobs must store a config_snapshot_json copy of the relevant settings so:

  • existing queued or running jobs stay deterministic
  • later env edits only affect newly created jobs

Use four layers:

  1. Web Console
    • browser UI for dashboards, queue control, logs, and config management
  2. Management API
    • serves data and accepts job or config commands
  3. Job Orchestrator / Runner
    • single-process scheduler that owns queue progression, pause, resume, and recovery
  4. Existing Catalogsync Executors
    • reuse collect, sync, download, and upload behavior from current package modules

Why Not A Thin Shell Wrapper

Wrapping only download_all.sh and upload_all.sh would not reliably provide:

  • worker-level current song visibility
  • item-level retry
  • fine-grained recovery after process crashes
  • stable soft pause and resume

The console therefore needs first-class job and work-item tables instead of depending only on raw shell output.

Job Model

Active Job Policy

  • only one job may be running at a time
  • additional jobs stay queued
  • a paused job may later resume and reclaim the active slot

This keeps:

  • pause and resume semantics simple
  • resource ownership clear
  • crash recovery easier to reason about

Job Templates

Supported templates and stage chains:

  • 全链路
    • collect -> sync -> download -> upload
  • 仅采集
    • collect
  • 仅同步
    • sync
  • 同步+下载
    • sync -> download
  • 仅下载
    • download
  • 仅上传
    • upload
  • 下载+上传
    • download -> upload

Job Status

Recommended job statuses:

  • queued
  • running
  • pause_requested
  • paused
  • completed
  • completed_with_errors
  • failed
  • canceled

Stage Status

Recommended stage statuses:

  • pending
  • running
  • pause_requested
  • paused
  • completed
  • failed
  • skipped

Work Item Status

Recommended item statuses:

  • pending
  • running
  • succeeded
  • failed
  • interrupted
  • skipped
  • canceled

The work item is the recovery and retry granularity. This is what prevents a single failure from forcing a whole-catalog restart.

Data Model

Existing Table Reuse

Keep current business tables as the catalog truth:

  • playlist_pools
  • playlists
  • pool_playlists
  • songs
  • playlist_songs
  • artists
  • song_artists
  • file_locations
  • object_storage_backends

These continue to answer:

  • what playlists exist
  • what songs belong to each playlist
  • which files exist locally or remotely

The new console layer adds execution truth around them.

New Table: job_runs

Purpose:

  • represent one queued or active operator job

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_type TEXT NOT NULL
status TEXT NOT NULL
priority INTEGER NOT NULL DEFAULT 100
requested_by TEXT
config_snapshot_json TEXT NOT NULL
sources TEXT
download_sources TEXT
playlist_scope_json TEXT
created_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
ended_at TEXT
last_error TEXT
resume_token TEXT

New Table: job_stages

Purpose:

  • track the stage-level execution status inside one job

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
stage_type TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
seq_no INTEGER NOT NULL
total_items INTEGER NOT NULL DEFAULT 0
pending_items INTEGER NOT NULL DEFAULT 0
running_items INTEGER NOT NULL DEFAULT 0
success_items INTEGER NOT NULL DEFAULT 0
failed_items INTEGER NOT NULL DEFAULT 0
skipped_items INTEGER NOT NULL DEFAULT 0
started_at TEXT
ended_at TEXT
last_error TEXT

New Table: job_items

Purpose:

  • track the real execution unit for recovery and retry

Granularity by stage:

  • collect
    • one pool/source fetch unit
  • sync
    • one playlist expansion unit
  • download
    • one song download unit
  • upload
    • one file upload unit

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_stage_id INTEGER NOT NULL
item_type TEXT NOT NULL
item_key TEXT NOT NULL
playlist_pool_id INTEGER
playlist_id INTEGER
song_id INTEGER
file_location_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
attempt_count INTEGER NOT NULL DEFAULT 0
max_attempts INTEGER NOT NULL DEFAULT 3
worker_id INTEGER
started_at TEXT
ended_at TEXT
last_error TEXT
last_error_code TEXT
payload_json TEXT
UNIQUE(job_stage_id, item_key)

New Table: job_workers

Purpose:

  • surface live worker state to the UI
  • show which song each worker is processing

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_name TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'idle'
current_job_item_id INTEGER
current_song_id INTEGER
current_playlist_id INTEGER
current_display_text TEXT
heartbeat_at TEXT
last_progress_text TEXT
processed_count INTEGER NOT NULL DEFAULT 0
error_count INTEGER NOT NULL DEFAULT 0

New Table: job_commands

Purpose:

  • safely bridge UI actions and runner behavior

Recommended command types:

  • pause
  • resume
  • cancel
  • retry_item
  • force_retry_item

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
command_type TEXT NOT NULL
target_item_id INTEGER
status TEXT NOT NULL DEFAULT 'pending'
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
payload_json TEXT

New Table: job_events

Purpose:

  • structured audit trail for major runner events

Recommended event types include:

  • job_started
  • stage_started
  • item_started
  • item_failed
  • pause_requested
  • resumed
  • worker_heartbeat
  • recovery_requeued

New Table: job_logs

Purpose:

  • queryable log lines for the UI

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
job_run_id INTEGER NOT NULL
job_stage_id INTEGER
worker_id INTEGER
level TEXT NOT NULL
message TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP

New Table: config_revisions

Purpose:

  • keep revision history of catalogsync.env

Recommended fields:

id INTEGER PRIMARY KEY AUTOINCREMENT
source_type TEXT NOT NULL DEFAULT 'env_file'
file_path TEXT NOT NULL
content_text TEXT NOT NULL
content_hash TEXT NOT NULL
created_at TEXT DEFAULT CURRENT_TIMESTAMP
applied_at TEXT
note TEXT

UI Design

Page 1: Dashboard

Show:

  • current active job
  • queue length
  • downloaded song count
  • uploaded file count
  • failed item count
  • per-stage summaries
  • recent exceptions
  • worker heartbeat overview

Page 2: Job Center

Show:

  • queued jobs
  • running or paused job
  • job template
  • scope
  • stage progression
  • pause, resume, cancel controls

Allow:

  • creating a new job from the supported templates
  • changing priority of queued jobs if desired

Page 3: Playlist Pools

Show:

  • all playlist pools and playlists
  • source platform
  • pool kind
  • song count
  • downloaded count
  • uploaded count
  • main status
  • current stage
  • last processed time
  • latest error summary

Derived Playlist Status Rules

Recommend deriving the main status as:

  • 异常
    • any recent failed item exists for the playlist
  • 进行中
    • any running or pause-requested item exists
  • 未完成
    • unfinished items remain but the playlist is not actively processing
  • 已完成
    • no unfinished item remains in the relevant pipeline scope

Page 4: Song Processing

Show:

  • each worker and its current song
  • failed songs
  • interrupted songs
  • retryable items

Allow:

  • retry single item
  • force-retry single item
  • filter by stage, platform, playlist, or error state

Page 5: Logs And Exceptions

Show:

  • structured events
  • text logs
  • job-level and item-level errors
  • stack traces or HTTP error summaries where available

Page 6: Config Management

Show:

  • current catalogsync.env
  • parsed effective values
  • validation errors
  • revision history

Allow:

  • save a new env revision
  • re-apply a previous revision

Rule:

  • config edits affect only future jobs unless an explicit resume override is supplied

API Surface

Recommended management endpoints:

  • GET /api/dashboard
  • GET /api/jobs
  • POST /api/jobs
  • GET /api/jobs/{id}
  • POST /api/jobs/{id}/pause
  • POST /api/jobs/{id}/resume
  • POST /api/jobs/{id}/cancel
  • GET /api/jobs/{id}/items
  • POST /api/job-items/{id}/retry
  • POST /api/job-items/{id}/force-retry
  • GET /api/workers
  • GET /api/playlists
  • GET /api/playlists/{id}
  • GET /api/logs
  • GET /api/config/env
  • PUT /api/config/env
  • GET /api/config/revisions
  • POST /api/config/revisions/{id}/apply
  • GET /api/events/stream

/api/events/stream should use server-sent events so the dashboard and worker pages can refresh without polling every table separately.

Pause, Resume, And Recovery Rules

Soft Pause

The only supported pause mode in v1 is soft pause.

Behavior:

  • UI inserts a pause command
  • the runner marks the job and current stage as pause_requested
  • workers stop claiming new items
  • any in-progress item is allowed to finish naturally
  • once all workers are idle, the stage becomes paused and then the job becomes paused

This avoids half-written file state and keeps item completion boundaries clean.

Resume

Resume behavior:

  • UI inserts a resume command
  • the runner validates the job can continue
  • the runner resets paused stage and job state back to running
  • unstarted items stay pending
  • succeeded items remain untouched

The resume action may optionally carry a limited override payload, such as a new library root after disk exhaustion.

Crash Recovery

On runner startup:

  1. find all jobs with status running or pause_requested
  2. mark those jobs paused
  3. find all job_items left in running
  4. convert those items to interrupted
  5. record a recovery event

After that:

  • succeeded items remain done
  • pending items remain pending
  • interrupted items become eligible for retry or auto-requeue depending on stage policy
  • failed items remain failed until explicit retry

This preserves progress without restarting the whole job or whole database.

Retry Rules

Single Item Retry

When the operator clicks retry for a failed or interrupted item:

  • insert job_commands.retry_item
  • clear execution fields on the target item
  • set status back to pending
  • increment attempt_count on the next worker claim

Force Retry

Force retry is more aggressive:

  • download stage may ignore an existing local mapping if the operator requests a fresh re-download
  • upload stage may ignore an existing active remote mapping if the operator explicitly wants a re-upload

Force retry must stay item-scoped, never job-scoped.

Disk Exhaustion Handling

If the downloader detects insufficient space:

  • fail or interrupt the current download item
  • pause the active job with a machine-readable reason such as disk_full
  • surface a UI banner asking for a new library root override

After the operator supplies a new directory and clicks resume:

  • the job continues only for unfinished items
  • completed downloads are not restarted
  • the currently failed song can be retried from scratch

This matches the requirement that one song may restart while the whole database must not restart.

Execution Strategy

Stage Executors

Implement separate executor paths for:

  • collect
  • sync
  • download
  • upload

Recommended concurrency:

  • collect
    • low concurrency, v1 may stay serial
  • sync
    • low concurrency, v1 may stay serial
  • download
    • configurable worker pool
  • upload
    • configurable worker pool

Reuse Strategy

Prefer reusing current catalogsync modules:

  • musicdl.catalogsync.services
  • musicdl.catalogsync.downloader
  • musicdl.catalogsync.uploader
  • musicdl.catalogsync.repository

The runner should orchestrate these modules rather than rewriting the domain logic from scratch.

Technology Choice

Backend

Recommended stack:

  • FastAPI
  • Jinja2
  • SQLite
  • SSE for live updates

Frontend

Recommended rendering model:

  • server-rendered pages with Jinja2
  • HTMX for partial updates and action forms
  • a small amount of vanilla JavaScript for log streaming and live worker refresh

Why this fits:

  • NAS-local internal tool
  • mainly operational tables and actions
  • lower dependency and deployment complexity than a separate SPA
  • easier to keep aligned with the existing Python-only project

Verification Plan

The implementation should be verified at four levels:

  1. unit tests
    • state transitions
    • retry rules
    • recovery transforms
  2. API integration tests
    • job creation
    • pause and resume
    • item retry
    • config revision flow
  3. fault injection tests
    • kill the runner mid-download and confirm item-level recovery
  4. NAS smoke tests
    • create jobs
    • pause and resume
    • crash and restart
    • retry a single failed song
    • change library directory after disk-full pause

V1 Delivery Boundary

Must Ship In V1

  • queue-based single-active-job runner
  • supported job templates
  • dashboard, job center, playlist pools, song processing, logs, and config pages
  • soft pause and resume
  • crash-safe item-level recovery
  • single-item retry and force-retry
  • env revision history and apply flow

Explicitly Deferred

  • authentication
  • multi-user permissions
  • multiple active jobs
  • distributed workers
  • arbitrary stage composition
  • automatic endless retries
  • destructive file cleanup actions

Open Follow-Up Items

Two source-coverage follow-ups remain outside this console design and should stay tracked separately:

  • redeploy the local Kuwo toplist fallback fix to the NAS and backfill the missing collection or sync results
  • repair QQ playlist square collection after the old endpoint started returning parameter failed

These belong to operational backlog work, not to the web console architecture itself.