Files
musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-20-download-dual-pool-pipeline-design.md
T

464 lines
13 KiB
Markdown

# Catalogsync Download Dual-Pool Pipeline Design
## Goal
Improve real download concurrency without changing the sync stage or introducing a sync-time download URL cache.
The current bottleneck is not the byte-transfer implementation itself. The real bottleneck is that each download worker performs two very different jobs in sequence:
1. resolve a usable download source and URL
2. transfer audio bytes and record the finished file
In production, source resolution often takes tens of seconds while the final audio transfer may take around one second. As a result, `DOWNLOAD_WORKERS=10` behaves like ten mixed workers waiting on resolve work instead of ten true download workers.
This design splits the download stage into a two-pool in-memory pipeline:
- `resolver pool`
- `download pool`
The sync stage remains unchanged. Songs are still stored as deferred snapshots and download URLs are still resolved at download time.
## Confirmed Decisions
The following points were confirmed during design discussion:
- do not change the sync stage
- do not introduce a sync-time download URL cache as part of this iteration
- focus only on download-stage behavior
- the target outcome is that download workers spend their time on actual downloads instead of long source-resolution work
- UI clarity matters:
- operators should be able to tell which workers are resolving and which workers are downloading
- existing database schema should be preserved if possible
## Scope
### In Scope
- split the download stage into resolver workers and downloader workers
- keep the existing job, stage, and item lifecycle model
- preserve existing deferred snapshot storage
- preserve current local file recording and quality detection behavior
- surface resolver activity and download activity clearly in worker state
- keep pause, cancel, and recovery semantics compatible with the current runner
### Out Of Scope
- changing playlist sync behavior
- persisting resolved download URLs across runs
- redesigning source ranking logic
- changing upload behavior
- changing the meaning of song uniqueness in the database
- introducing distributed workers or external queues
## Problem Statement
Current download-stage flow:
1. a runner worker claims a download item
2. the worker calls `CatalogDownloader.download_song_row(...)`
3. inside that flow, the same worker:
- deserializes the deferred snapshot
- resolves a usable source across multiple providers
- downloads the final audio file
- records the local file
This model creates two user-visible problems:
- most workers appear idle from a transfer perspective because they are blocked in source resolution
- byte-transfer concurrency is much lower than the configured worker count
Recent production measurements showed the pattern clearly:
- source resolution commonly takes about `77-83s`
- actual file download commonly takes about `1s`
So the current worker pool is structurally spending most of its time in the wrong phase.
## Approaches Considered
### Approach A: Keep Single Pool And Only Improve UI
Show resolver activity more clearly, but keep each worker as `resolve + download`.
Pros:
- smallest code change
- no pipeline coordination logic
Cons:
- does not materially improve true download concurrency
- preserves the main performance bottleneck
Decision:
- rejected for this task because it improves observability but not throughput
### Approach B: Implement Dual Pools Inside `CatalogDownloader`
Move queueing and split-pool logic into `CatalogDownloader`.
Pros:
- conceptually local to download code
- useful for non-ops batch paths
Cons:
- mismatches the current ops runner lifecycle
- complicates job item ownership, pause, cancel, and worker naming
- less natural for the NAS task center, which already manages workers at runner level
Decision:
- not preferred for this iteration
### Approach C: Implement Dual Pools At Download Stage Runner Level
Create a download-stage pipeline in the ops runner:
- resolver workers claim items and produce ready-to-download tasks
- downloader workers consume ready tasks and perform final transfer
Pros:
- fits current job/stage/item orchestration naturally
- keeps worker ownership explicit
- lets dashboard show separate resolver and downloader workers
- delivers real transfer concurrency gains without changing sync behavior
Cons:
- more control-flow complexity in the runner
- requires careful queue shutdown and pause/cancel handling
Decision:
- recommended
## Recommended Design
## High-Level Architecture
During a `download` stage, the runner will create a bounded in-memory queue:
- `ready_queue`
The stage will use two thread pools:
- `resolver pool`
- `download pool`
### Resolver Pool Responsibilities
- claim pending download items
- check whether the song is already downloaded
- build the download row
- resolve a usable `SongInfo` with a valid download URL
- publish a `ResolvedDownloadTask` into `ready_queue`
- mark the item failed immediately if resolution cannot produce a usable download target
### Download Pool Responsibilities
- consume `ResolvedDownloadTask` instances from `ready_queue`
- execute actual file download only
- emit transfer progress
- record local file metadata
- mark the item succeeded or failed
This separates long-latency provider resolution from short, bandwidth-heavy transfer work.
## New Internal Data Model
Introduce an internal in-memory task object for the stage, for example:
```python
@dataclass
class ResolvedDownloadTask:
item_id: int
row: dict[str, Any]
resolved_song_info: Any
display_text: str
target_library_root: Path
```
This object is not persisted to the database in this iteration.
## Worker Model
The dashboard should show two worker families for a running download stage:
- `resolve-1`, `resolve-2`, ...
- `download-1`, `download-2`, ...
This is intentional. The operator should be able to distinguish:
- workers currently finding a usable source
- workers currently transferring bytes
`transfer_stats` should continue to count only workers with real transfer speed values.
## Download Stage Flow
### Step 1: Stage Startup
When the runner enters a `download` stage:
1. compute total worker budget from existing configuration
2. split it into resolver and downloader counts
3. create a bounded `ready_queue`
4. start resolver pool and downloader pool
### Step 2: Item Resolution
Each resolver worker loops until:
- no more claimable items remain
- pause or cancel is requested
- pipeline shutdown is triggered
For each claimed item:
1. load row data
2. skip immediately if already downloaded
3. emit resolver progress such as `resolving source qq (1/6)`
4. call a new downloader API that resolves but does not download
5. enqueue a `ResolvedDownloadTask` on success
6. mark failed on resolution failure
### Step 3: Pure Download Execution
Each downloader worker loops until:
- a shutdown sentinel is received
- pause or cancel is requested and the queue has drained according to the chosen shutdown policy
For each resolved task:
1. emit `starting download via <platform>`
2. monitor file growth and emit transfer stats
3. record the local file on success
4. mark the item succeeded or failed
## CatalogDownloader API Refactor
Keep the current public behavior but split the implementation into two explicit phases.
### New Methods
- `resolve_song_row(...) -> ResolvedDownloadPayload | None`
- `download_resolved_song(...) -> bool`
Where:
- `resolve_song_row(...)` handles snapshot deserialization, source resolution, target directory selection, and worker text for the resolver phase
- `download_resolved_song(...)` performs only final download, monitor setup, file recording, and quality detection
### Compatibility Method
Keep:
- `download_song_row(...)`
But turn it into a compatibility wrapper:
1. resolve
2. download
This preserves existing unit-test entry points and any non-ops call sites.
## Worker State Design
Resolver workers should update:
- `current_song_id`
- `current_display_text`
- `last_progress_text`
Example messages:
- `resolving source qq (1/6)`
- `resolving source kuwo (2/6)`
- `resolved via qq`
Downloader workers should update:
- `current_song_id`
- `current_display_text`
- `last_progress_text`
- `downloaded_bytes`
- `total_bytes`
- `speed_bytes_per_sec`
- `progress_percent`
Example messages:
- `starting download via qq`
- `12.00MB/48.00MB`
## Concurrency Split
Do not require a schema change or mandatory new env vars for the first version.
Recommended default behavior:
- if total download worker budget is `1`, use `1 resolver, 0 downloader` is invalid, so coerce to single-thread compatibility path
- if total is `2`, use `1 resolver + 1 downloader`
- if total is `>= 3`, use approximately `30% resolver` and `70% downloader`
Initial recommended rule:
```text
resolver_workers = max(1, min(3, total_workers // 3))
download_workers = max(1, total_workers - resolver_workers)
```
For `DOWNLOAD_WORKERS=10`, this gives:
- `3 resolver`
- `7 downloader`
This is a reasonable first cut and avoids over-investing worker budget in resolution.
## Queue Design
Use a bounded in-memory queue to avoid resolver workers running too far ahead.
Recommended initial capacity:
- `download_workers * 2`
Why bounded:
- prevents unbounded memory growth
- keeps resolution work closer to actual download demand
- simplifies pause and cancel behavior
## Pause, Cancel, And Shutdown Behavior
### Pause
When pause is requested:
- resolver workers stop claiming new items
- downloader workers may finish in-flight downloads
- stage reconciliation remains based on existing item states
This matches current expectations better than attempting hard interruption of active downloads.
### Cancel
When cancel is requested:
- resolver workers stop claiming new items immediately
- downloader workers stop after their current task boundary where possible
- no new resolved tasks should be enqueued after cancellation is observed
### Queue Shutdown
After resolver workers finish, the runner should send explicit queue sentinels so downloader workers can exit cleanly once the queue drains.
## Failure Handling
### Resolution Failure
If resolution cannot produce a valid downloadable `SongInfo`:
- mark the item failed immediately
- do not enqueue it for download
### Download Failure
If pure download fails after resolution:
- mark the item failed
- preserve the existing error formatting model
### Resolver Success But Queue/Shutdown Race
If the pipeline is shutting down and a resolver has a resolved task ready:
- prefer not enqueuing new work after pause/cancel has been observed
- let the item remain in a recoverable state according to current reconciliation rules
The first implementation should prefer correctness over aggressive continuation.
## Why This Improves Throughput
Under the current model, ten workers spend most of their time waiting on provider resolution.
Under the dual-pool model:
- a small resolver pool continues finding usable sources
- a larger downloader pool stays focused on byte transfer
This does not make provider resolution free, but it stops long resolution latency from occupying the same worker budget needed for real downloads.
The expected operator-visible result is:
- multiple downloader workers can show real transfer progress concurrently
- resolver workers remain visible as separate activity instead of appearing as fake download workers
## Testing Strategy
## Unit Tests
Extend downloader tests to cover:
- `resolve_song_row(...)` returning a resolved payload without downloading
- `download_resolved_song(...)` preserving existing progress and file-recording behavior
- compatibility wrapper `download_song_row(...)` still working
## Runner Tests
Add runner tests for:
- worker split calculation
- resolver workers feeding downloader workers through a queue
- successful completion of mixed resolved tasks
- pause and cancel behavior while the queue is non-empty
- clean worker shutdown after resolver completion
## Dashboard-Oriented Tests
Add ops tests to verify:
- resolver workers appear with resolver progress text
- downloader workers expose transfer metrics
- aggregate transfer stats ignore resolver-only workers
## Rollout Plan
1. refactor `CatalogDownloader` into resolve-only and download-only phases
2. add dual-pool execution path for the download stage in the runner
3. keep the old single-call wrapper for compatibility
4. update worker naming and dashboard expectations
5. run targeted NAS verification:
- confirm simultaneous non-zero transfer speed on more than one downloader worker
- confirm resolver workers remain visible separately
## Open Questions Resolved In This Design
- Should sync be changed to resolve URLs early?
- No.
- Should this iteration add persistent URL caching?
- No.
- Should resolver and downloader state share the same worker names?
- No. Separate names are clearer and better match reality.
- Should the first version require schema changes?
- No.
## Summary
The recommended change is to keep deferred snapshots exactly as they are and redesign only the download-stage execution model.
Instead of ten mixed workers doing `resolve + download`, the system should run a two-pool pipeline:
- a small resolver pool that turns deferred snapshots into ready download tasks
- a larger downloader pool that performs real file transfer
This is the smallest architecture change that directly targets the current bottleneck while preserving the existing sync model and database schema.