Files
musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-16-object-storage-upload-design.md
T

15 KiB
Raw Blame History

Object Storage Upload Automation Design

Goal

Extend musicdl.catalogsync with a first-class object storage upload workflow that:

  • uploads downloaded local files to an S3-compatible object storage backend
  • preserves local files after upload
  • mirrors the local relative path into the remote object key
  • records remote locations in the catalog database
  • tracks backend presence per song for fast lookup
  • supports queue-based upload execution and limited concurrency
  • updates docs/catalogsync.md alongside the implementation so operator docs stay current

This sub-project also introduces limited concurrent download so very large catalogs do not have to run fully serially.

Scope

In Scope

  • Add a queue-based upload workflow for object storage backends
  • Reuse storage_backends, file_assets, and file_locations as the primary storage model
  • Add a song/backend presence summary table
  • Add an upload task queue table
  • Add CLI commands to register an object storage backend and upload files to it
  • Support S3-compatible object storage as the first upload backend type
  • Store non-secret backend configuration in the database
  • Read secrets from environment variables at runtime
  • Mirror local relative paths into remote object keys
  • Keep local files after successful upload
  • Mark remote object locations as non-primary while local files remain primary
  • Support queue-based concurrent upload workers
  • Add limited concurrent download workers
  • When download space is exhausted, pause the whole download flow once, prompt for a new directory once, then continue later tasks under the new root
  • Update docs/catalogsync.md to document the upload workflow, object storage backend configuration, and the new commands

Out of Scope

  • 123 cloud implementation
  • Baidu Netdisk implementation
  • Remote HEAD verification before every upload
  • Automatic deletion of local files after upload
  • Multi-backend upload in a single command
  • GUI integration
  • CDN upload orchestration beyond deriving an optional public URL
  • Background daemon / scheduler service

Constraints

  • Keep the current musicdl.catalogsync data model as the source of truth
  • Do not duplicate file location truth into songs
  • Do not store secret access credentials in SQLite
  • First upload backend must be generic S3-compatible object storage
  • Default behavior must trust database state rather than querying remote object existence every time
  • Upload behavior must preserve existing local download behavior
  • Download and upload concurrency must remain limited and operator-controllable

Use the existing storage model as the base:

  • storage_backends
    • backend definition
  • file_assets
    • file-version identity
  • file_locations
    • concrete physical or remote locations

Add two new layers:

  • song_backend_presence
    • fast summary of whether a song has active files on a given backend
  • upload_tasks
    • queue of upload work items per file asset and target backend/key

Implement one new uploader component:

  • S3CompatibleUploader
    • resolves credentials from environment
    • uploads a local file to a configured backend
    • writes the resulting remote file location
    • refreshes backend presence

Keep the user-facing CLI small:

  • register-object-backend
  • upload

Internally, upload should still be queue-driven:

  1. enumerate missing remote uploads
  2. enqueue deduplicated tasks
  3. consume tasks with limited workers

Data Model

Existing Table Reuse

storage_backends

Object storage backends should reuse the current table with the following conventions:

  • backend_type = 'object_storage'
  • name
    • stable operator-facing backend name, for example main-s3
  • container_name
    • object storage bucket name
  • base_path
    • unused for object storage, may remain NULL
  • config_json
    • non-secret configuration only

Recommended config_json keys:

  • endpoint
  • region
  • base_prefix
  • addressing_style
  • public_base_url
  • credential_env_prefix

Secrets must not be stored here.

file_assets

No semantic changes are required.

The upload unit stays aligned with the current model:

  • one file_asset represents one concrete file version for a song
  • if a song has multiple active local file versions, all of them are eligible for upload

file_locations

No structural redesign is required.

For object storage locations:

  • backend_id
    • target object storage backend
  • container_name
    • bucket
  • locator
    • object key
  • absolute_path
    • NULL
  • remote_file_id
    • optional, reserved for future provider-specific remote IDs
  • public_url
    • derived if backend config provides public_base_url
  • download_url
    • optional, first version may keep this NULL
  • status
    • active, deleted, or failed
  • is_primary
    • 0 for remote object storage in the first version

The local location remains:

  • status = 'active'
  • is_primary = 1

New Table: song_backend_presence

Purpose:

  • answer “does this song have active files on backend X?” quickly
  • avoid pushing hard-coded backend presence fields into songs

Recommended schema:

song_id INTEGER NOT NULL
backend_id INTEGER NOT NULL
has_active_file INTEGER NOT NULL DEFAULT 0
active_file_count INTEGER NOT NULL DEFAULT 0
primary_file_location_id INTEGER
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
PRIMARY KEY(song_id, backend_id)

Rules:

  • this is a derived summary table, not the source of truth
  • truth still comes from file_locations
  • refresh this row whenever a location on that song/backend becomes active or inactive

New Table: upload_tasks

Purpose:

  • queue upload work
  • support retries, concurrency, and resumable batch execution

Recommended schema:

id INTEGER PRIMARY KEY AUTOINCREMENT
file_asset_id INTEGER NOT NULL
source_location_id INTEGER NOT NULL
target_backend_id INTEGER NOT NULL
target_container_name TEXT
target_locator TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
attempts INTEGER NOT NULL DEFAULT 0
last_error TEXT
queued_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
finished_at TEXT
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
UNIQUE(file_asset_id, target_backend_id, target_locator)

Task granularity:

  • one task = one local file asset version uploaded to one target backend/key

This keeps the queue aligned with your “upload all active file versions” requirement.

Object Storage Key Rules

Key Shape

The object key should mirror the local relative path beneath the configured backend prefix.

If:

  • local relative path is qq/Singer A/song-c.mp3
  • backend base_prefix is music

Then:

  • remote key becomes music/qq/Singer A/song-c.mp3

Why Mirror The Relative Path

  • easiest to reconnect local and remote locations
  • preserves the existing local organization
  • keeps future CDN and migration mapping simple
  • reuses the semantics already established in file_locations.locator

Credential Model

Database Versus Secrets

Store only non-secret backend config in SQLite.

Resolve secrets from environment variables using the backends configured prefix.

Example:

  • backend name: main-s3
  • credential_env_prefix = CATALOGSYNC_MAIN_S3

Runtime lookup:

  • CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID
  • CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY
  • CATALOGSYNC_MAIN_S3_SESSION_TOKEN optional

Why This Model

  • portable for long-running batch jobs
  • safer than storing keys in SQLite
  • works well across multiple machines and deployment targets

CLI Design

register-object-backend

Purpose:

  • create or update one object storage backend definition

Example:

musicdl-catalogsync register-object-backend \
  --db D:\catalogsync\catalogsync.db \
  --backend main-s3 \
  --endpoint https://s3.example.com \
  --bucket music \
  --base-prefix music \
  --region auto \
  --addressing-style auto \
  --public-base-url https://cdn.example.com/music \
  --credential-env-prefix CATALOGSYNC_MAIN_S3

Behavior:

  • upsert backend by name
  • set backend_type='object_storage'
  • validate required non-secret config before writing

upload

Purpose:

  • default: upload all local active file versions that are missing on the target backend
  • optionally filter by source platform, playlist range, and count

Example:

musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4

Default semantics:

  • trust database state
  • do not do remote HEAD by default
  • enqueue missing uploads
  • consume queue with limited workers

Download CLI Extension

Extend the existing download and run workflows with:

  • --workers

First-version default:

  • download --workers 3
  • upload --workers 4

These defaults should remain conservative and configurable.

Upload Execution Flow

Phase 1: Candidate Selection

For the target backend:

  • find all active local file_locations
  • resolve their file_asset
  • derive target object key from:
    • backend base_prefix
    • local relative path
  • skip assets that already have an active remote location on the same backend/key

Selection must support:

  • all local songs
  • --sources
  • --playlist-ids
  • --limit

Phase 2: Task Enqueue

For each missing remote file:

  • insert or reuse a unique upload_tasks row
  • set status to pending unless it is already uploading or succeeded

Phase 3: Worker Claim

Each worker should:

  • claim one pending task in a transaction
  • move it to uploading
  • set started_at

This must prevent duplicate worker claims.

Phase 4: Upload

For each claimed task:

  • resolve source local file from source_location_id
  • validate that the file still exists
  • resolve backend config
  • resolve credentials from environment
  • upload to S3-compatible storage

Phase 5: Writeback

On success:

  • write or upsert the remote file_location
  • set remote status='active'
  • keep remote is_primary=0
  • refresh song_backend_presence
  • mark task succeeded
  • set finished_at

Upload Task State Machine

Use these first-version task states:

  • pending
  • uploading
  • succeeded
  • failed
  • skipped

State transitions:

  • enqueue → pending
  • worker claim → uploading
  • success with DB writeback → succeeded
  • upload error or writeback error → failed
  • no-op due to already-active remote location → skipped

Retry model:

  • store attempts
  • store last_error
  • later upload runs may requeue or retry failed tasks under a bounded retry rule

Backend Presence Refresh Rules

Whenever a remote location changes on (song_id, backend_id):

  • count active locations for that song/backend
  • update has_active_file
  • update active_file_count
  • set primary_file_location_id to a preferred active location on that backend

First version preference rule:

  • if any active location exists on that backend, pick one deterministic row, for example the smallest active file_locations.id

This table exists for fast lookup and operator queries, not for deciding the actual upload truth.

Limited Concurrency Design

Download Concurrency

Add limited worker-based download concurrency.

Key rule:

  • disk-space exhaustion must trigger one global pause, not one prompt per worker

Behavior:

  1. workers process queued download items
  2. if a worker detects insufficient space under the current active root:
    • raise a shared pause request
    • stop dispatching new tasks
  3. prompt the operator once for a new download directory
  4. switch the shared active root
  5. resume remaining not-yet-started tasks under the new root

Non-goals:

  • per-worker independent root switching
  • automatic multi-root balancing in the first version

Upload Concurrency

Upload workers should process queue rows concurrently but conservatively.

Requirements:

  • claim tasks transactionally
  • prevent duplicate uploads of the same (file_asset_id, backend_id, locator)
  • keep worker count operator-controlled

Error Handling

Upload Errors

  • missing source file
    • mark task failed
    • set descriptive last_error
  • missing backend config
    • fail fast before batch execution
  • missing environment credentials
    • fail fast before batch execution
  • upload transport error
    • mark task failed
  • upload succeeded but DB writeback failed
    • mark task failed
    • store explicit last_error explaining that remote upload may already exist

Download Errors

  • worker download failure
    • record failure for that item and continue with other tasks
  • insufficient disk space
    • trigger one global directory-switch prompt
  • no replacement directory supplied
    • fail the remaining batch clearly

Testing

Add or update coverage for the following areas.

Schema Tests

  • song_backend_presence exists
  • upload_tasks exists
  • unique constraint on upload task dedupe works

Repository Tests

  • register or upsert object storage backends
  • write remote file_locations
  • refresh song_backend_presence
  • enqueue deduplicated upload tasks
  • select pending upload candidates by backend, source, playlist, and limit

Uploader / Service Tests

Using a fake or stub S3-compatible client:

  • successful upload creates active remote location
  • public URL derivation when configured
  • missing source file becomes failed
  • missing credentials fail fast
  • multiple local file versions for one song are all enqueued

CLI Tests

  • register-object-backend
  • upload --backend ...
  • upload --sources ...
  • upload --playlist-ids ...
  • upload --limit ...
  • upload --workers ...
  • download --workers ...

Concurrency Tests

  • concurrent upload workers do not claim the same task twice
  • concurrent download workers trigger only one directory switch prompt
  • after directory switch, later downloads use the new root

Documentation Tests

  • docs/catalogsync.md is updated to describe:
    • object storage backend registration
    • upload command usage
    • queue semantics
    • environment variable credential model
    • download/upload worker options

Documentation Requirements

Implementation must update docs/catalogsync.md to include:

  • why object storage uses backend config plus env-based secrets
  • how to register an object storage backend
  • how remote keys mirror local relative paths
  • how upload works by default
  • what song_backend_presence and upload_tasks are for
  • how --workers affects download and upload
  • how the global download directory switch behaves under low disk space

Acceptance Criteria

  • An operator can register an S3-compatible object storage backend without storing secrets in SQLite
  • upload can enqueue and execute uploads for missing remote files on that backend
  • Remote object keys mirror local relative paths beneath the configured backend prefix
  • Successful uploads create active remote file_locations
  • Local files remain active and primary after upload
  • song_backend_presence shows whether a song has active files on a given backend
  • upload_tasks supports resumable queue execution with bounded retries
  • The first version uploads all active local file versions for a song, not just one version
  • upload supports both full backend fill-in mode and filtered mode
  • Download and upload both support limited operator-configurable concurrency
  • Low disk space during download triggers one global prompt and one shared root switch for later tasks
  • docs/catalogsync.md is updated together with the implementation