# Object Storage Upload Automation Design ## Goal Extend `musicdl.catalogsync` with a first-class object storage upload workflow that: - uploads downloaded local files to an S3-compatible object storage backend - preserves local files after upload - mirrors the local relative path into the remote object key - records remote locations in the catalog database - tracks backend presence per song for fast lookup - supports queue-based upload execution and limited concurrency - updates `docs/catalogsync.md` alongside the implementation so operator docs stay current This sub-project also introduces limited concurrent download so very large catalogs do not have to run fully serially. ## Scope ### In Scope - Add a queue-based upload workflow for object storage backends - Reuse `storage_backends`, `file_assets`, and `file_locations` as the primary storage model - Add a song/backend presence summary table - Add an upload task queue table - Add CLI commands to register an object storage backend and upload files to it - Support S3-compatible object storage as the first upload backend type - Store non-secret backend configuration in the database - Read secrets from environment variables at runtime - Mirror local relative paths into remote object keys - Keep local files after successful upload - Mark remote object locations as non-primary while local files remain primary - Support queue-based concurrent upload workers - Add limited concurrent download workers - When download space is exhausted, pause the whole download flow once, prompt for a new directory once, then continue later tasks under the new root - Update `docs/catalogsync.md` to document the upload workflow, object storage backend configuration, and the new commands ### Out of Scope - 123 cloud implementation - Baidu Netdisk implementation - Remote `HEAD` verification before every upload - Automatic deletion of local files after upload - Multi-backend upload in a single command - GUI integration - CDN upload orchestration beyond deriving an optional public URL - Background daemon / scheduler service ## Constraints - Keep the current `musicdl.catalogsync` data model as the source of truth - Do not duplicate file location truth into `songs` - Do not store secret access credentials in SQLite - First upload backend must be generic S3-compatible object storage - Default behavior must trust database state rather than querying remote object existence every time - Upload behavior must preserve existing local download behavior - Download and upload concurrency must remain limited and operator-controllable ## Recommended Architecture Use the existing storage model as the base: - `storage_backends` - backend definition - `file_assets` - file-version identity - `file_locations` - concrete physical or remote locations Add two new layers: - `song_backend_presence` - fast summary of whether a song has active files on a given backend - `upload_tasks` - queue of upload work items per file asset and target backend/key Implement one new uploader component: - `S3CompatibleUploader` - resolves credentials from environment - uploads a local file to a configured backend - writes the resulting remote file location - refreshes backend presence Keep the user-facing CLI small: - `register-object-backend` - `upload` Internally, `upload` should still be queue-driven: 1. enumerate missing remote uploads 2. enqueue deduplicated tasks 3. consume tasks with limited workers ## Data Model ### Existing Table Reuse #### `storage_backends` Object storage backends should reuse the current table with the following conventions: - `backend_type = 'object_storage'` - `name` - stable operator-facing backend name, for example `main-s3` - `container_name` - object storage bucket name - `base_path` - unused for object storage, may remain `NULL` - `config_json` - non-secret configuration only Recommended `config_json` keys: - `endpoint` - `region` - `base_prefix` - `addressing_style` - `public_base_url` - `credential_env_prefix` Secrets must not be stored here. #### `file_assets` No semantic changes are required. The upload unit stays aligned with the current model: - one `file_asset` represents one concrete file version for a song - if a song has multiple active local file versions, all of them are eligible for upload #### `file_locations` No structural redesign is required. For object storage locations: - `backend_id` - target object storage backend - `container_name` - bucket - `locator` - object key - `absolute_path` - `NULL` - `remote_file_id` - optional, reserved for future provider-specific remote IDs - `public_url` - derived if backend config provides `public_base_url` - `download_url` - optional, first version may keep this `NULL` - `status` - `active`, `deleted`, or `failed` - `is_primary` - `0` for remote object storage in the first version The local location remains: - `status = 'active'` - `is_primary = 1` ### New Table: `song_backend_presence` Purpose: - answer “does this song have active files on backend X?” quickly - avoid pushing hard-coded backend presence fields into `songs` Recommended schema: ```text song_id INTEGER NOT NULL backend_id INTEGER NOT NULL has_active_file INTEGER NOT NULL DEFAULT 0 active_file_count INTEGER NOT NULL DEFAULT 0 primary_file_location_id INTEGER updated_at TEXT DEFAULT CURRENT_TIMESTAMP PRIMARY KEY(song_id, backend_id) ``` Rules: - this is a derived summary table, not the source of truth - truth still comes from `file_locations` - refresh this row whenever a location on that song/backend becomes active or inactive ### New Table: `upload_tasks` Purpose: - queue upload work - support retries, concurrency, and resumable batch execution Recommended schema: ```text id INTEGER PRIMARY KEY AUTOINCREMENT file_asset_id INTEGER NOT NULL source_location_id INTEGER NOT NULL target_backend_id INTEGER NOT NULL target_container_name TEXT target_locator TEXT NOT NULL status TEXT NOT NULL DEFAULT 'pending' attempts INTEGER NOT NULL DEFAULT 0 last_error TEXT queued_at TEXT DEFAULT CURRENT_TIMESTAMP started_at TEXT finished_at TEXT updated_at TEXT DEFAULT CURRENT_TIMESTAMP UNIQUE(file_asset_id, target_backend_id, target_locator) ``` Task granularity: - one task = one local file asset version uploaded to one target backend/key This keeps the queue aligned with your “upload all active file versions” requirement. ## Object Storage Key Rules ### Key Shape The object key should mirror the local relative path beneath the configured backend prefix. If: - local relative path is `qq/Singer A/song-c.mp3` - backend `base_prefix` is `music` Then: - remote key becomes `music/qq/Singer A/song-c.mp3` ### Why Mirror The Relative Path - easiest to reconnect local and remote locations - preserves the existing local organization - keeps future CDN and migration mapping simple - reuses the semantics already established in `file_locations.locator` ## Credential Model ### Database Versus Secrets Store only non-secret backend config in SQLite. Resolve secrets from environment variables using the backend’s configured prefix. Example: - backend name: `main-s3` - `credential_env_prefix = CATALOGSYNC_MAIN_S3` Runtime lookup: - `CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID` - `CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY` - `CATALOGSYNC_MAIN_S3_SESSION_TOKEN` optional ### Why This Model - portable for long-running batch jobs - safer than storing keys in SQLite - works well across multiple machines and deployment targets ## CLI Design ### `register-object-backend` Purpose: - create or update one object storage backend definition Example: ```bash musicdl-catalogsync register-object-backend \ --db D:\catalogsync\catalogsync.db \ --backend main-s3 \ --endpoint https://s3.example.com \ --bucket music \ --base-prefix music \ --region auto \ --addressing-style auto \ --public-base-url https://cdn.example.com/music \ --credential-env-prefix CATALOGSYNC_MAIN_S3 ``` Behavior: - upsert backend by `name` - set `backend_type='object_storage'` - validate required non-secret config before writing ### `upload` Purpose: - default: upload all local active file versions that are missing on the target backend - optionally filter by source platform, playlist range, and count Example: ```bash musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200 musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4 ``` Default semantics: - trust database state - do not do remote `HEAD` by default - enqueue missing uploads - consume queue with limited workers ### Download CLI Extension Extend the existing `download` and `run` workflows with: - `--workers` First-version default: - `download --workers 3` - `upload --workers 4` These defaults should remain conservative and configurable. ## Upload Execution Flow ### Phase 1: Candidate Selection For the target backend: - find all active local `file_locations` - resolve their `file_asset` - derive target object key from: - backend `base_prefix` - local relative path - skip assets that already have an active remote location on the same backend/key Selection must support: - all local songs - `--sources` - `--playlist-ids` - `--limit` ### Phase 2: Task Enqueue For each missing remote file: - insert or reuse a unique `upload_tasks` row - set status to `pending` unless it is already `uploading` or `succeeded` ### Phase 3: Worker Claim Each worker should: - claim one `pending` task in a transaction - move it to `uploading` - set `started_at` This must prevent duplicate worker claims. ### Phase 4: Upload For each claimed task: - resolve source local file from `source_location_id` - validate that the file still exists - resolve backend config - resolve credentials from environment - upload to S3-compatible storage ### Phase 5: Writeback On success: - write or upsert the remote `file_location` - set remote `status='active'` - keep remote `is_primary=0` - refresh `song_backend_presence` - mark task `succeeded` - set `finished_at` ## Upload Task State Machine Use these first-version task states: - `pending` - `uploading` - `succeeded` - `failed` - `skipped` State transitions: - enqueue → `pending` - worker claim → `uploading` - success with DB writeback → `succeeded` - upload error or writeback error → `failed` - no-op due to already-active remote location → `skipped` Retry model: - store `attempts` - store `last_error` - later `upload` runs may requeue or retry `failed` tasks under a bounded retry rule ## Backend Presence Refresh Rules Whenever a remote location changes on `(song_id, backend_id)`: - count active locations for that song/backend - update `has_active_file` - update `active_file_count` - set `primary_file_location_id` to a preferred active location on that backend First version preference rule: - if any active location exists on that backend, pick one deterministic row, for example the smallest active `file_locations.id` This table exists for fast lookup and operator queries, not for deciding the actual upload truth. ## Limited Concurrency Design ### Download Concurrency Add limited worker-based download concurrency. Key rule: - disk-space exhaustion must trigger one global pause, not one prompt per worker Behavior: 1. workers process queued download items 2. if a worker detects insufficient space under the current active root: - raise a shared pause request - stop dispatching new tasks 3. prompt the operator once for a new download directory 4. switch the shared active root 5. resume remaining not-yet-started tasks under the new root Non-goals: - per-worker independent root switching - automatic multi-root balancing in the first version ### Upload Concurrency Upload workers should process queue rows concurrently but conservatively. Requirements: - claim tasks transactionally - prevent duplicate uploads of the same `(file_asset_id, backend_id, locator)` - keep worker count operator-controlled ## Error Handling ### Upload Errors - missing source file - mark task `failed` - set descriptive `last_error` - missing backend config - fail fast before batch execution - missing environment credentials - fail fast before batch execution - upload transport error - mark task `failed` - upload succeeded but DB writeback failed - mark task `failed` - store explicit `last_error` explaining that remote upload may already exist ### Download Errors - worker download failure - record failure for that item and continue with other tasks - insufficient disk space - trigger one global directory-switch prompt - no replacement directory supplied - fail the remaining batch clearly ## Testing Add or update coverage for the following areas. ### Schema Tests - `song_backend_presence` exists - `upload_tasks` exists - unique constraint on upload task dedupe works ### Repository Tests - register or upsert object storage backends - write remote `file_locations` - refresh `song_backend_presence` - enqueue deduplicated upload tasks - select pending upload candidates by backend, source, playlist, and limit ### Uploader / Service Tests Using a fake or stub S3-compatible client: - successful upload creates active remote location - public URL derivation when configured - missing source file becomes `failed` - missing credentials fail fast - multiple local file versions for one song are all enqueued ### CLI Tests - `register-object-backend` - `upload --backend ...` - `upload --sources ...` - `upload --playlist-ids ...` - `upload --limit ...` - `upload --workers ...` - `download --workers ...` ### Concurrency Tests - concurrent upload workers do not claim the same task twice - concurrent download workers trigger only one directory switch prompt - after directory switch, later downloads use the new root ### Documentation Tests - `docs/catalogsync.md` is updated to describe: - object storage backend registration - upload command usage - queue semantics - environment variable credential model - download/upload worker options ## Documentation Requirements Implementation must update `docs/catalogsync.md` to include: - why object storage uses backend config plus env-based secrets - how to register an object storage backend - how remote keys mirror local relative paths - how `upload` works by default - what `song_backend_presence` and `upload_tasks` are for - how `--workers` affects download and upload - how the global download directory switch behaves under low disk space ## Acceptance Criteria - An operator can register an S3-compatible object storage backend without storing secrets in SQLite - `upload` can enqueue and execute uploads for missing remote files on that backend - Remote object keys mirror local relative paths beneath the configured backend prefix - Successful uploads create active remote `file_locations` - Local files remain active and primary after upload - `song_backend_presence` shows whether a song has active files on a given backend - `upload_tasks` supports resumable queue execution with bounded retries - The first version uploads all active local file versions for a song, not just one version - `upload` supports both full backend fill-in mode and filtered mode - Download and upload both support limited operator-configurable concurrency - Low disk space during download triggers one global prompt and one shared root switch for later tasks - `docs/catalogsync.md` is updated together with the implementation