15 KiB
Object Storage Upload Automation Design
Goal
Extend musicdl.catalogsync with a first-class object storage upload workflow that:
- uploads downloaded local files to an S3-compatible object storage backend
- preserves local files after upload
- mirrors the local relative path into the remote object key
- records remote locations in the catalog database
- tracks backend presence per song for fast lookup
- supports queue-based upload execution and limited concurrency
- updates
docs/catalogsync.mdalongside the implementation so operator docs stay current
This sub-project also introduces limited concurrent download so very large catalogs do not have to run fully serially.
Scope
In Scope
- Add a queue-based upload workflow for object storage backends
- Reuse
storage_backends,file_assets, andfile_locationsas the primary storage model - Add a song/backend presence summary table
- Add an upload task queue table
- Add CLI commands to register an object storage backend and upload files to it
- Support S3-compatible object storage as the first upload backend type
- Store non-secret backend configuration in the database
- Read secrets from environment variables at runtime
- Mirror local relative paths into remote object keys
- Keep local files after successful upload
- Mark remote object locations as non-primary while local files remain primary
- Support queue-based concurrent upload workers
- Add limited concurrent download workers
- When download space is exhausted, pause the whole download flow once, prompt for a new directory once, then continue later tasks under the new root
- Update
docs/catalogsync.mdto document the upload workflow, object storage backend configuration, and the new commands
Out of Scope
- 123 cloud implementation
- Baidu Netdisk implementation
- Remote
HEADverification before every upload - Automatic deletion of local files after upload
- Multi-backend upload in a single command
- GUI integration
- CDN upload orchestration beyond deriving an optional public URL
- Background daemon / scheduler service
Constraints
- Keep the current
musicdl.catalogsyncdata model as the source of truth - Do not duplicate file location truth into
songs - Do not store secret access credentials in SQLite
- First upload backend must be generic S3-compatible object storage
- Default behavior must trust database state rather than querying remote object existence every time
- Upload behavior must preserve existing local download behavior
- Download and upload concurrency must remain limited and operator-controllable
Recommended Architecture
Use the existing storage model as the base:
storage_backends- backend definition
file_assets- file-version identity
file_locations- concrete physical or remote locations
Add two new layers:
song_backend_presence- fast summary of whether a song has active files on a given backend
upload_tasks- queue of upload work items per file asset and target backend/key
Implement one new uploader component:
S3CompatibleUploader- resolves credentials from environment
- uploads a local file to a configured backend
- writes the resulting remote file location
- refreshes backend presence
Keep the user-facing CLI small:
register-object-backendupload
Internally, upload should still be queue-driven:
- enumerate missing remote uploads
- enqueue deduplicated tasks
- consume tasks with limited workers
Data Model
Existing Table Reuse
storage_backends
Object storage backends should reuse the current table with the following conventions:
backend_type = 'object_storage'name- stable operator-facing backend name, for example
main-s3
- stable operator-facing backend name, for example
container_name- object storage bucket name
base_path- unused for object storage, may remain
NULL
- unused for object storage, may remain
config_json- non-secret configuration only
Recommended config_json keys:
endpointregionbase_prefixaddressing_stylepublic_base_urlcredential_env_prefix
Secrets must not be stored here.
file_assets
No semantic changes are required.
The upload unit stays aligned with the current model:
- one
file_assetrepresents one concrete file version for a song - if a song has multiple active local file versions, all of them are eligible for upload
file_locations
No structural redesign is required.
For object storage locations:
backend_id- target object storage backend
container_name- bucket
locator- object key
absolute_pathNULL
remote_file_id- optional, reserved for future provider-specific remote IDs
public_url- derived if backend config provides
public_base_url
- derived if backend config provides
download_url- optional, first version may keep this
NULL
- optional, first version may keep this
statusactive,deleted, orfailed
is_primary0for remote object storage in the first version
The local location remains:
status = 'active'is_primary = 1
New Table: song_backend_presence
Purpose:
- answer “does this song have active files on backend X?” quickly
- avoid pushing hard-coded backend presence fields into
songs
Recommended schema:
song_id INTEGER NOT NULL
backend_id INTEGER NOT NULL
has_active_file INTEGER NOT NULL DEFAULT 0
active_file_count INTEGER NOT NULL DEFAULT 0
primary_file_location_id INTEGER
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
PRIMARY KEY(song_id, backend_id)
Rules:
- this is a derived summary table, not the source of truth
- truth still comes from
file_locations - refresh this row whenever a location on that song/backend becomes active or inactive
New Table: upload_tasks
Purpose:
- queue upload work
- support retries, concurrency, and resumable batch execution
Recommended schema:
id INTEGER PRIMARY KEY AUTOINCREMENT
file_asset_id INTEGER NOT NULL
source_location_id INTEGER NOT NULL
target_backend_id INTEGER NOT NULL
target_container_name TEXT
target_locator TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
attempts INTEGER NOT NULL DEFAULT 0
last_error TEXT
queued_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
finished_at TEXT
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
UNIQUE(file_asset_id, target_backend_id, target_locator)
Task granularity:
- one task = one local file asset version uploaded to one target backend/key
This keeps the queue aligned with your “upload all active file versions” requirement.
Object Storage Key Rules
Key Shape
The object key should mirror the local relative path beneath the configured backend prefix.
If:
- local relative path is
qq/Singer A/song-c.mp3 - backend
base_prefixismusic
Then:
- remote key becomes
music/qq/Singer A/song-c.mp3
Why Mirror The Relative Path
- easiest to reconnect local and remote locations
- preserves the existing local organization
- keeps future CDN and migration mapping simple
- reuses the semantics already established in
file_locations.locator
Credential Model
Database Versus Secrets
Store only non-secret backend config in SQLite.
Resolve secrets from environment variables using the backend’s configured prefix.
Example:
- backend name:
main-s3 credential_env_prefix = CATALOGSYNC_MAIN_S3
Runtime lookup:
CATALOGSYNC_MAIN_S3_ACCESS_KEY_IDCATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEYCATALOGSYNC_MAIN_S3_SESSION_TOKENoptional
Why This Model
- portable for long-running batch jobs
- safer than storing keys in SQLite
- works well across multiple machines and deployment targets
CLI Design
register-object-backend
Purpose:
- create or update one object storage backend definition
Example:
musicdl-catalogsync register-object-backend \
--db D:\catalogsync\catalogsync.db \
--backend main-s3 \
--endpoint https://s3.example.com \
--bucket music \
--base-prefix music \
--region auto \
--addressing-style auto \
--public-base-url https://cdn.example.com/music \
--credential-env-prefix CATALOGSYNC_MAIN_S3
Behavior:
- upsert backend by
name - set
backend_type='object_storage' - validate required non-secret config before writing
upload
Purpose:
- default: upload all local active file versions that are missing on the target backend
- optionally filter by source platform, playlist range, and count
Example:
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4
Default semantics:
- trust database state
- do not do remote
HEADby default - enqueue missing uploads
- consume queue with limited workers
Download CLI Extension
Extend the existing download and run workflows with:
--workers
First-version default:
download --workers 3upload --workers 4
These defaults should remain conservative and configurable.
Upload Execution Flow
Phase 1: Candidate Selection
For the target backend:
- find all active local
file_locations - resolve their
file_asset - derive target object key from:
- backend
base_prefix - local relative path
- backend
- skip assets that already have an active remote location on the same backend/key
Selection must support:
- all local songs
--sources--playlist-ids--limit
Phase 2: Task Enqueue
For each missing remote file:
- insert or reuse a unique
upload_tasksrow - set status to
pendingunless it is alreadyuploadingorsucceeded
Phase 3: Worker Claim
Each worker should:
- claim one
pendingtask in a transaction - move it to
uploading - set
started_at
This must prevent duplicate worker claims.
Phase 4: Upload
For each claimed task:
- resolve source local file from
source_location_id - validate that the file still exists
- resolve backend config
- resolve credentials from environment
- upload to S3-compatible storage
Phase 5: Writeback
On success:
- write or upsert the remote
file_location - set remote
status='active' - keep remote
is_primary=0 - refresh
song_backend_presence - mark task
succeeded - set
finished_at
Upload Task State Machine
Use these first-version task states:
pendinguploadingsucceededfailedskipped
State transitions:
- enqueue →
pending - worker claim →
uploading - success with DB writeback →
succeeded - upload error or writeback error →
failed - no-op due to already-active remote location →
skipped
Retry model:
- store
attempts - store
last_error - later
uploadruns may requeue or retryfailedtasks under a bounded retry rule
Backend Presence Refresh Rules
Whenever a remote location changes on (song_id, backend_id):
- count active locations for that song/backend
- update
has_active_file - update
active_file_count - set
primary_file_location_idto a preferred active location on that backend
First version preference rule:
- if any active location exists on that backend, pick one deterministic row, for example the smallest active
file_locations.id
This table exists for fast lookup and operator queries, not for deciding the actual upload truth.
Limited Concurrency Design
Download Concurrency
Add limited worker-based download concurrency.
Key rule:
- disk-space exhaustion must trigger one global pause, not one prompt per worker
Behavior:
- workers process queued download items
- if a worker detects insufficient space under the current active root:
- raise a shared pause request
- stop dispatching new tasks
- prompt the operator once for a new download directory
- switch the shared active root
- resume remaining not-yet-started tasks under the new root
Non-goals:
- per-worker independent root switching
- automatic multi-root balancing in the first version
Upload Concurrency
Upload workers should process queue rows concurrently but conservatively.
Requirements:
- claim tasks transactionally
- prevent duplicate uploads of the same
(file_asset_id, backend_id, locator) - keep worker count operator-controlled
Error Handling
Upload Errors
- missing source file
- mark task
failed - set descriptive
last_error
- mark task
- missing backend config
- fail fast before batch execution
- missing environment credentials
- fail fast before batch execution
- upload transport error
- mark task
failed
- mark task
- upload succeeded but DB writeback failed
- mark task
failed - store explicit
last_errorexplaining that remote upload may already exist
- mark task
Download Errors
- worker download failure
- record failure for that item and continue with other tasks
- insufficient disk space
- trigger one global directory-switch prompt
- no replacement directory supplied
- fail the remaining batch clearly
Testing
Add or update coverage for the following areas.
Schema Tests
song_backend_presenceexistsupload_tasksexists- unique constraint on upload task dedupe works
Repository Tests
- register or upsert object storage backends
- write remote
file_locations - refresh
song_backend_presence - enqueue deduplicated upload tasks
- select pending upload candidates by backend, source, playlist, and limit
Uploader / Service Tests
Using a fake or stub S3-compatible client:
- successful upload creates active remote location
- public URL derivation when configured
- missing source file becomes
failed - missing credentials fail fast
- multiple local file versions for one song are all enqueued
CLI Tests
register-object-backendupload --backend ...upload --sources ...upload --playlist-ids ...upload --limit ...upload --workers ...download --workers ...
Concurrency Tests
- concurrent upload workers do not claim the same task twice
- concurrent download workers trigger only one directory switch prompt
- after directory switch, later downloads use the new root
Documentation Tests
docs/catalogsync.mdis updated to describe:- object storage backend registration
- upload command usage
- queue semantics
- environment variable credential model
- download/upload worker options
Documentation Requirements
Implementation must update docs/catalogsync.md to include:
- why object storage uses backend config plus env-based secrets
- how to register an object storage backend
- how remote keys mirror local relative paths
- how
uploadworks by default - what
song_backend_presenceandupload_tasksare for - how
--workersaffects download and upload - how the global download directory switch behaves under low disk space
Acceptance Criteria
- An operator can register an S3-compatible object storage backend without storing secrets in SQLite
uploadcan enqueue and execute uploads for missing remote files on that backend- Remote object keys mirror local relative paths beneath the configured backend prefix
- Successful uploads create active remote
file_locations - Local files remain active and primary after upload
song_backend_presenceshows whether a song has active files on a given backendupload_taskssupports resumable queue execution with bounded retries- The first version uploads all active local file versions for a song, not just one version
uploadsupports both full backend fill-in mode and filtered mode- Download and upload both support limited operator-configurable concurrency
- Low disk space during download triggers one global prompt and one shared root switch for later tasks
docs/catalogsync.mdis updated together with the implementation