# Object Storage Upload Automation Design

## Goal

Extend `musicdl.catalogsync` with a first-class object storage upload workflow that:

- uploads downloaded local files to an S3-compatible object storage backend
- preserves local files after upload
- mirrors the local relative path into the remote object key
- records remote locations in the catalog database
- tracks backend presence per song for fast lookup
- supports queue-based upload execution and limited concurrency
- updates `docs/catalogsync.md` alongside the implementation so operator docs stay current

This sub-project also introduces limited concurrent download so very large catalogs do not have to run fully serially.

## Scope

### In Scope

- Add a queue-based upload workflow for object storage backends
- Reuse `storage_backends`, `file_assets`, and `file_locations` as the primary storage model
- Add a song/backend presence summary table
- Add an upload task queue table
- Add CLI commands to register an object storage backend and upload files to it
- Support S3-compatible object storage as the first upload backend type
- Store non-secret backend configuration in the database
- Read secrets from environment variables at runtime
- Mirror local relative paths into remote object keys
- Keep local files after successful upload
- Mark remote object locations as non-primary while local files remain primary
- Support queue-based concurrent upload workers
- Add limited concurrent download workers
- When download space is exhausted, pause the whole download flow once, prompt for a new directory once, then continue later tasks under the new root
- Update `docs/catalogsync.md` to document the upload workflow, object storage backend configuration, and the new commands

### Out of Scope

- 123 cloud implementation
- Baidu Netdisk implementation
- Remote `HEAD` verification before every upload
- Automatic deletion of local files after upload
- Multi-backend upload in a single command
- GUI integration
- CDN upload orchestration beyond deriving an optional public URL
- Background daemon / scheduler service

## Constraints

- Keep the current `musicdl.catalogsync` data model as the source of truth
- Do not duplicate file location truth into `songs`
- Do not store secret access credentials in SQLite
- First upload backend must be generic S3-compatible object storage
- Default behavior must trust database state rather than querying remote object existence every time
- Upload behavior must preserve existing local download behavior
- Download and upload concurrency must remain limited and operator-controllable

## Recommended Architecture

Use the existing storage model as the base:

- `storage_backends`
  - backend definition
- `file_assets`
  - file-version identity
- `file_locations`
  - concrete physical or remote locations

Add two new layers:

- `song_backend_presence`
  - fast summary of whether a song has active files on a given backend
- `upload_tasks`
  - queue of upload work items per file asset and target backend/key

Implement one new uploader component:

- `S3CompatibleUploader`
  - resolves credentials from environment
  - uploads a local file to a configured backend
  - writes the resulting remote file location
  - refreshes backend presence

Keep the user-facing CLI small:

- `register-object-backend`
- `upload`

Internally, `upload` should still be queue-driven:

1. enumerate missing remote uploads
2. enqueue deduplicated tasks
3. consume tasks with limited workers

## Data Model

### Existing Table Reuse

#### `storage_backends`

Object storage backends should reuse the current table with the following conventions:

- `backend_type = 'object_storage'`
- `name`
  - stable operator-facing backend name, for example `main-s3`
- `container_name`
  - object storage bucket name
- `base_path`
  - unused for object storage, may remain `NULL`
- `config_json`
  - non-secret configuration only

Recommended `config_json` keys:

- `endpoint`
- `region`
- `base_prefix`
- `addressing_style`
- `public_base_url`
- `credential_env_prefix`

Secrets must not be stored here.

#### `file_assets`

No semantic changes are required.

The upload unit stays aligned with the current model:

- one `file_asset` represents one concrete file version for a song
- if a song has multiple active local file versions, all of them are eligible for upload

#### `file_locations`

No structural redesign is required.

For object storage locations:

- `backend_id`
  - target object storage backend
- `container_name`
  - bucket
- `locator`
  - object key
- `absolute_path`
  - `NULL`
- `remote_file_id`
  - optional, reserved for future provider-specific remote IDs
- `public_url`
  - derived if backend config provides `public_base_url`
- `download_url`
  - optional, first version may keep this `NULL`
- `status`
  - `active`, `deleted`, or `failed`
- `is_primary`
  - `0` for remote object storage in the first version

The local location remains:

- `status = 'active'`
- `is_primary = 1`

### New Table: `song_backend_presence`

Purpose:

- answer “does this song have active files on backend X?” quickly
- avoid pushing hard-coded backend presence fields into `songs`

Recommended schema:

```text
song_id INTEGER NOT NULL
backend_id INTEGER NOT NULL
has_active_file INTEGER NOT NULL DEFAULT 0
active_file_count INTEGER NOT NULL DEFAULT 0
primary_file_location_id INTEGER
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
PRIMARY KEY(song_id, backend_id)
```

Rules:

- this is a derived summary table, not the source of truth
- truth still comes from `file_locations`
- refresh this row whenever a location on that song/backend becomes active or inactive

### New Table: `upload_tasks`

Purpose:

- queue upload work
- support retries, concurrency, and resumable batch execution

Recommended schema:

```text
id INTEGER PRIMARY KEY AUTOINCREMENT
file_asset_id INTEGER NOT NULL
source_location_id INTEGER NOT NULL
target_backend_id INTEGER NOT NULL
target_container_name TEXT
target_locator TEXT NOT NULL
status TEXT NOT NULL DEFAULT 'pending'
attempts INTEGER NOT NULL DEFAULT 0
last_error TEXT
queued_at TEXT DEFAULT CURRENT_TIMESTAMP
started_at TEXT
finished_at TEXT
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
UNIQUE(file_asset_id, target_backend_id, target_locator)
```

Task granularity:

- one task = one local file asset version uploaded to one target backend/key

This keeps the queue aligned with your “upload all active file versions” requirement.

## Object Storage Key Rules

### Key Shape

The object key should mirror the local relative path beneath the configured backend prefix.

If:

- local relative path is `qq/Singer A/song-c.mp3`
- backend `base_prefix` is `music`

Then:

- remote key becomes `music/qq/Singer A/song-c.mp3`

### Why Mirror The Relative Path

- easiest to reconnect local and remote locations
- preserves the existing local organization
- keeps future CDN and migration mapping simple
- reuses the semantics already established in `file_locations.locator`

## Credential Model

### Database Versus Secrets

Store only non-secret backend config in SQLite.

Resolve secrets from environment variables using the backend’s configured prefix.

Example:

- backend name: `main-s3`
- `credential_env_prefix = CATALOGSYNC_MAIN_S3`

Runtime lookup:

- `CATALOGSYNC_MAIN_S3_ACCESS_KEY_ID`
- `CATALOGSYNC_MAIN_S3_SECRET_ACCESS_KEY`
- `CATALOGSYNC_MAIN_S3_SESSION_TOKEN` optional

### Why This Model

- portable for long-running batch jobs
- safer than storing keys in SQLite
- works well across multiple machines and deployment targets

## CLI Design

### `register-object-backend`

Purpose:

- create or update one object storage backend definition

Example:

```bash
musicdl-catalogsync register-object-backend \
  --db D:\catalogsync\catalogsync.db \
  --backend main-s3 \
  --endpoint https://s3.example.com \
  --bucket music \
  --base-prefix music \
  --region auto \
  --addressing-style auto \
  --public-base-url https://cdn.example.com/music \
  --credential-env-prefix CATALOGSYNC_MAIN_S3
```

Behavior:

- upsert backend by `name`
- set `backend_type='object_storage'`
- validate required non-secret config before writing

### `upload`

Purpose:

- default: upload all local active file versions that are missing on the target backend
- optionally filter by source platform, playlist range, and count

Example:

```bash
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --sources netease,qq --limit 200
musicdl-catalogsync upload --db D:\catalogsync\catalogsync.db --backend main-s3 --playlist-ids 12,15 --workers 4
```

Default semantics:

- trust database state
- do not do remote `HEAD` by default
- enqueue missing uploads
- consume queue with limited workers

### Download CLI Extension

Extend the existing `download` and `run` workflows with:

- `--workers`

First-version default:

- `download --workers 3`
- `upload --workers 4`

These defaults should remain conservative and configurable.

## Upload Execution Flow

### Phase 1: Candidate Selection

For the target backend:

- find all active local `file_locations`
- resolve their `file_asset`
- derive target object key from:
  - backend `base_prefix`
  - local relative path
- skip assets that already have an active remote location on the same backend/key

Selection must support:

- all local songs
- `--sources`
- `--playlist-ids`
- `--limit`

### Phase 2: Task Enqueue

For each missing remote file:

- insert or reuse a unique `upload_tasks` row
- set status to `pending` unless it is already `uploading` or `succeeded`

### Phase 3: Worker Claim

Each worker should:

- claim one `pending` task in a transaction
- move it to `uploading`
- set `started_at`

This must prevent duplicate worker claims.

### Phase 4: Upload

For each claimed task:

- resolve source local file from `source_location_id`
- validate that the file still exists
- resolve backend config
- resolve credentials from environment
- upload to S3-compatible storage

### Phase 5: Writeback

On success:

- write or upsert the remote `file_location`
- set remote `status='active'`
- keep remote `is_primary=0`
- refresh `song_backend_presence`
- mark task `succeeded`
- set `finished_at`

## Upload Task State Machine

Use these first-version task states:

- `pending`
- `uploading`
- `succeeded`
- `failed`
- `skipped`

State transitions:

- enqueue → `pending`
- worker claim → `uploading`
- success with DB writeback → `succeeded`
- upload error or writeback error → `failed`
- no-op due to already-active remote location → `skipped`

Retry model:

- store `attempts`
- store `last_error`
- later `upload` runs may requeue or retry `failed` tasks under a bounded retry rule

## Backend Presence Refresh Rules

Whenever a remote location changes on `(song_id, backend_id)`:

- count active locations for that song/backend
- update `has_active_file`
- update `active_file_count`
- set `primary_file_location_id` to a preferred active location on that backend

First version preference rule:

- if any active location exists on that backend, pick one deterministic row, for example the smallest active `file_locations.id`

This table exists for fast lookup and operator queries, not for deciding the actual upload truth.

## Limited Concurrency Design

### Download Concurrency

Add limited worker-based download concurrency.

Key rule:

- disk-space exhaustion must trigger one global pause, not one prompt per worker

Behavior:

1. workers process queued download items
2. if a worker detects insufficient space under the current active root:
   - raise a shared pause request
   - stop dispatching new tasks
3. prompt the operator once for a new download directory
4. switch the shared active root
5. resume remaining not-yet-started tasks under the new root

Non-goals:

- per-worker independent root switching
- automatic multi-root balancing in the first version

### Upload Concurrency

Upload workers should process queue rows concurrently but conservatively.

Requirements:

- claim tasks transactionally
- prevent duplicate uploads of the same `(file_asset_id, backend_id, locator)`
- keep worker count operator-controlled

## Error Handling

### Upload Errors

- missing source file
  - mark task `failed`
  - set descriptive `last_error`
- missing backend config
  - fail fast before batch execution
- missing environment credentials
  - fail fast before batch execution
- upload transport error
  - mark task `failed`
- upload succeeded but DB writeback failed
  - mark task `failed`
  - store explicit `last_error` explaining that remote upload may already exist

### Download Errors

- worker download failure
  - record failure for that item and continue with other tasks
- insufficient disk space
  - trigger one global directory-switch prompt
- no replacement directory supplied
  - fail the remaining batch clearly

## Testing

Add or update coverage for the following areas.

### Schema Tests

- `song_backend_presence` exists
- `upload_tasks` exists
- unique constraint on upload task dedupe works

### Repository Tests

- register or upsert object storage backends
- write remote `file_locations`
- refresh `song_backend_presence`
- enqueue deduplicated upload tasks
- select pending upload candidates by backend, source, playlist, and limit

### Uploader / Service Tests

Using a fake or stub S3-compatible client:

- successful upload creates active remote location
- public URL derivation when configured
- missing source file becomes `failed`
- missing credentials fail fast
- multiple local file versions for one song are all enqueued

### CLI Tests

- `register-object-backend`
- `upload --backend ...`
- `upload --sources ...`
- `upload --playlist-ids ...`
- `upload --limit ...`
- `upload --workers ...`
- `download --workers ...`

### Concurrency Tests

- concurrent upload workers do not claim the same task twice
- concurrent download workers trigger only one directory switch prompt
- after directory switch, later downloads use the new root

### Documentation Tests

- `docs/catalogsync.md` is updated to describe:
  - object storage backend registration
  - upload command usage
  - queue semantics
  - environment variable credential model
  - download/upload worker options

## Documentation Requirements

Implementation must update `docs/catalogsync.md` to include:

- why object storage uses backend config plus env-based secrets
- how to register an object storage backend
- how remote keys mirror local relative paths
- how `upload` works by default
- what `song_backend_presence` and `upload_tasks` are for
- how `--workers` affects download and upload
- how the global download directory switch behaves under low disk space

## Acceptance Criteria

- An operator can register an S3-compatible object storage backend without storing secrets in SQLite
- `upload` can enqueue and execute uploads for missing remote files on that backend
- Remote object keys mirror local relative paths beneath the configured backend prefix
- Successful uploads create active remote `file_locations`
- Local files remain active and primary after upload
- `song_backend_presence` shows whether a song has active files on a given backend
- `upload_tasks` supports resumable queue execution with bounded retries
- The first version uploads all active local file versions for a song, not just one version
- `upload` supports both full backend fill-in mode and filtered mode
- Download and upload both support limited operator-configurable concurrency
- Low disk space during download triggers one global prompt and one shared root switch for later tasks
- `docs/catalogsync.md` is updated together with the implementation