464 lines
13 KiB
Markdown
464 lines
13 KiB
Markdown
# Catalogsync Download Dual-Pool Pipeline Design
|
|
|
|
## Goal
|
|
|
|
Improve real download concurrency without changing the sync stage or introducing a sync-time download URL cache.
|
|
|
|
The current bottleneck is not the byte-transfer implementation itself. The real bottleneck is that each download worker performs two very different jobs in sequence:
|
|
|
|
1. resolve a usable download source and URL
|
|
2. transfer audio bytes and record the finished file
|
|
|
|
In production, source resolution often takes tens of seconds while the final audio transfer may take around one second. As a result, `DOWNLOAD_WORKERS=10` behaves like ten mixed workers waiting on resolve work instead of ten true download workers.
|
|
|
|
This design splits the download stage into a two-pool in-memory pipeline:
|
|
|
|
- `resolver pool`
|
|
- `download pool`
|
|
|
|
The sync stage remains unchanged. Songs are still stored as deferred snapshots and download URLs are still resolved at download time.
|
|
|
|
## Confirmed Decisions
|
|
|
|
The following points were confirmed during design discussion:
|
|
|
|
- do not change the sync stage
|
|
- do not introduce a sync-time download URL cache as part of this iteration
|
|
- focus only on download-stage behavior
|
|
- the target outcome is that download workers spend their time on actual downloads instead of long source-resolution work
|
|
- UI clarity matters:
|
|
- operators should be able to tell which workers are resolving and which workers are downloading
|
|
- existing database schema should be preserved if possible
|
|
|
|
## Scope
|
|
|
|
### In Scope
|
|
|
|
- split the download stage into resolver workers and downloader workers
|
|
- keep the existing job, stage, and item lifecycle model
|
|
- preserve existing deferred snapshot storage
|
|
- preserve current local file recording and quality detection behavior
|
|
- surface resolver activity and download activity clearly in worker state
|
|
- keep pause, cancel, and recovery semantics compatible with the current runner
|
|
|
|
### Out Of Scope
|
|
|
|
- changing playlist sync behavior
|
|
- persisting resolved download URLs across runs
|
|
- redesigning source ranking logic
|
|
- changing upload behavior
|
|
- changing the meaning of song uniqueness in the database
|
|
- introducing distributed workers or external queues
|
|
|
|
## Problem Statement
|
|
|
|
Current download-stage flow:
|
|
|
|
1. a runner worker claims a download item
|
|
2. the worker calls `CatalogDownloader.download_song_row(...)`
|
|
3. inside that flow, the same worker:
|
|
- deserializes the deferred snapshot
|
|
- resolves a usable source across multiple providers
|
|
- downloads the final audio file
|
|
- records the local file
|
|
|
|
This model creates two user-visible problems:
|
|
|
|
- most workers appear idle from a transfer perspective because they are blocked in source resolution
|
|
- byte-transfer concurrency is much lower than the configured worker count
|
|
|
|
Recent production measurements showed the pattern clearly:
|
|
|
|
- source resolution commonly takes about `77-83s`
|
|
- actual file download commonly takes about `1s`
|
|
|
|
So the current worker pool is structurally spending most of its time in the wrong phase.
|
|
|
|
## Approaches Considered
|
|
|
|
### Approach A: Keep Single Pool And Only Improve UI
|
|
|
|
Show resolver activity more clearly, but keep each worker as `resolve + download`.
|
|
|
|
Pros:
|
|
|
|
- smallest code change
|
|
- no pipeline coordination logic
|
|
|
|
Cons:
|
|
|
|
- does not materially improve true download concurrency
|
|
- preserves the main performance bottleneck
|
|
|
|
Decision:
|
|
|
|
- rejected for this task because it improves observability but not throughput
|
|
|
|
### Approach B: Implement Dual Pools Inside `CatalogDownloader`
|
|
|
|
Move queueing and split-pool logic into `CatalogDownloader`.
|
|
|
|
Pros:
|
|
|
|
- conceptually local to download code
|
|
- useful for non-ops batch paths
|
|
|
|
Cons:
|
|
|
|
- mismatches the current ops runner lifecycle
|
|
- complicates job item ownership, pause, cancel, and worker naming
|
|
- less natural for the NAS task center, which already manages workers at runner level
|
|
|
|
Decision:
|
|
|
|
- not preferred for this iteration
|
|
|
|
### Approach C: Implement Dual Pools At Download Stage Runner Level
|
|
|
|
Create a download-stage pipeline in the ops runner:
|
|
|
|
- resolver workers claim items and produce ready-to-download tasks
|
|
- downloader workers consume ready tasks and perform final transfer
|
|
|
|
Pros:
|
|
|
|
- fits current job/stage/item orchestration naturally
|
|
- keeps worker ownership explicit
|
|
- lets dashboard show separate resolver and downloader workers
|
|
- delivers real transfer concurrency gains without changing sync behavior
|
|
|
|
Cons:
|
|
|
|
- more control-flow complexity in the runner
|
|
- requires careful queue shutdown and pause/cancel handling
|
|
|
|
Decision:
|
|
|
|
- recommended
|
|
|
|
## Recommended Design
|
|
|
|
## High-Level Architecture
|
|
|
|
During a `download` stage, the runner will create a bounded in-memory queue:
|
|
|
|
- `ready_queue`
|
|
|
|
The stage will use two thread pools:
|
|
|
|
- `resolver pool`
|
|
- `download pool`
|
|
|
|
### Resolver Pool Responsibilities
|
|
|
|
- claim pending download items
|
|
- check whether the song is already downloaded
|
|
- build the download row
|
|
- resolve a usable `SongInfo` with a valid download URL
|
|
- publish a `ResolvedDownloadTask` into `ready_queue`
|
|
- mark the item failed immediately if resolution cannot produce a usable download target
|
|
|
|
### Download Pool Responsibilities
|
|
|
|
- consume `ResolvedDownloadTask` instances from `ready_queue`
|
|
- execute actual file download only
|
|
- emit transfer progress
|
|
- record local file metadata
|
|
- mark the item succeeded or failed
|
|
|
|
This separates long-latency provider resolution from short, bandwidth-heavy transfer work.
|
|
|
|
## New Internal Data Model
|
|
|
|
Introduce an internal in-memory task object for the stage, for example:
|
|
|
|
```python
|
|
@dataclass
|
|
class ResolvedDownloadTask:
|
|
item_id: int
|
|
row: dict[str, Any]
|
|
resolved_song_info: Any
|
|
display_text: str
|
|
target_library_root: Path
|
|
```
|
|
|
|
This object is not persisted to the database in this iteration.
|
|
|
|
## Worker Model
|
|
|
|
The dashboard should show two worker families for a running download stage:
|
|
|
|
- `resolve-1`, `resolve-2`, ...
|
|
- `download-1`, `download-2`, ...
|
|
|
|
This is intentional. The operator should be able to distinguish:
|
|
|
|
- workers currently finding a usable source
|
|
- workers currently transferring bytes
|
|
|
|
`transfer_stats` should continue to count only workers with real transfer speed values.
|
|
|
|
## Download Stage Flow
|
|
|
|
### Step 1: Stage Startup
|
|
|
|
When the runner enters a `download` stage:
|
|
|
|
1. compute total worker budget from existing configuration
|
|
2. split it into resolver and downloader counts
|
|
3. create a bounded `ready_queue`
|
|
4. start resolver pool and downloader pool
|
|
|
|
### Step 2: Item Resolution
|
|
|
|
Each resolver worker loops until:
|
|
|
|
- no more claimable items remain
|
|
- pause or cancel is requested
|
|
- pipeline shutdown is triggered
|
|
|
|
For each claimed item:
|
|
|
|
1. load row data
|
|
2. skip immediately if already downloaded
|
|
3. emit resolver progress such as `resolving source qq (1/6)`
|
|
4. call a new downloader API that resolves but does not download
|
|
5. enqueue a `ResolvedDownloadTask` on success
|
|
6. mark failed on resolution failure
|
|
|
|
### Step 3: Pure Download Execution
|
|
|
|
Each downloader worker loops until:
|
|
|
|
- a shutdown sentinel is received
|
|
- pause or cancel is requested and the queue has drained according to the chosen shutdown policy
|
|
|
|
For each resolved task:
|
|
|
|
1. emit `starting download via <platform>`
|
|
2. monitor file growth and emit transfer stats
|
|
3. record the local file on success
|
|
4. mark the item succeeded or failed
|
|
|
|
## CatalogDownloader API Refactor
|
|
|
|
Keep the current public behavior but split the implementation into two explicit phases.
|
|
|
|
### New Methods
|
|
|
|
- `resolve_song_row(...) -> ResolvedDownloadPayload | None`
|
|
- `download_resolved_song(...) -> bool`
|
|
|
|
Where:
|
|
|
|
- `resolve_song_row(...)` handles snapshot deserialization, source resolution, target directory selection, and worker text for the resolver phase
|
|
- `download_resolved_song(...)` performs only final download, monitor setup, file recording, and quality detection
|
|
|
|
### Compatibility Method
|
|
|
|
Keep:
|
|
|
|
- `download_song_row(...)`
|
|
|
|
But turn it into a compatibility wrapper:
|
|
|
|
1. resolve
|
|
2. download
|
|
|
|
This preserves existing unit-test entry points and any non-ops call sites.
|
|
|
|
## Worker State Design
|
|
|
|
Resolver workers should update:
|
|
|
|
- `current_song_id`
|
|
- `current_display_text`
|
|
- `last_progress_text`
|
|
|
|
Example messages:
|
|
|
|
- `resolving source qq (1/6)`
|
|
- `resolving source kuwo (2/6)`
|
|
- `resolved via qq`
|
|
|
|
Downloader workers should update:
|
|
|
|
- `current_song_id`
|
|
- `current_display_text`
|
|
- `last_progress_text`
|
|
- `downloaded_bytes`
|
|
- `total_bytes`
|
|
- `speed_bytes_per_sec`
|
|
- `progress_percent`
|
|
|
|
Example messages:
|
|
|
|
- `starting download via qq`
|
|
- `12.00MB/48.00MB`
|
|
|
|
## Concurrency Split
|
|
|
|
Do not require a schema change or mandatory new env vars for the first version.
|
|
|
|
Recommended default behavior:
|
|
|
|
- if total download worker budget is `1`, use `1 resolver, 0 downloader` is invalid, so coerce to single-thread compatibility path
|
|
- if total is `2`, use `1 resolver + 1 downloader`
|
|
- if total is `>= 3`, use approximately `30% resolver` and `70% downloader`
|
|
|
|
Initial recommended rule:
|
|
|
|
```text
|
|
resolver_workers = max(1, min(3, total_workers // 3))
|
|
download_workers = max(1, total_workers - resolver_workers)
|
|
```
|
|
|
|
For `DOWNLOAD_WORKERS=10`, this gives:
|
|
|
|
- `3 resolver`
|
|
- `7 downloader`
|
|
|
|
This is a reasonable first cut and avoids over-investing worker budget in resolution.
|
|
|
|
## Queue Design
|
|
|
|
Use a bounded in-memory queue to avoid resolver workers running too far ahead.
|
|
|
|
Recommended initial capacity:
|
|
|
|
- `download_workers * 2`
|
|
|
|
Why bounded:
|
|
|
|
- prevents unbounded memory growth
|
|
- keeps resolution work closer to actual download demand
|
|
- simplifies pause and cancel behavior
|
|
|
|
## Pause, Cancel, And Shutdown Behavior
|
|
|
|
### Pause
|
|
|
|
When pause is requested:
|
|
|
|
- resolver workers stop claiming new items
|
|
- downloader workers may finish in-flight downloads
|
|
- stage reconciliation remains based on existing item states
|
|
|
|
This matches current expectations better than attempting hard interruption of active downloads.
|
|
|
|
### Cancel
|
|
|
|
When cancel is requested:
|
|
|
|
- resolver workers stop claiming new items immediately
|
|
- downloader workers stop after their current task boundary where possible
|
|
- no new resolved tasks should be enqueued after cancellation is observed
|
|
|
|
### Queue Shutdown
|
|
|
|
After resolver workers finish, the runner should send explicit queue sentinels so downloader workers can exit cleanly once the queue drains.
|
|
|
|
## Failure Handling
|
|
|
|
### Resolution Failure
|
|
|
|
If resolution cannot produce a valid downloadable `SongInfo`:
|
|
|
|
- mark the item failed immediately
|
|
- do not enqueue it for download
|
|
|
|
### Download Failure
|
|
|
|
If pure download fails after resolution:
|
|
|
|
- mark the item failed
|
|
- preserve the existing error formatting model
|
|
|
|
### Resolver Success But Queue/Shutdown Race
|
|
|
|
If the pipeline is shutting down and a resolver has a resolved task ready:
|
|
|
|
- prefer not enqueuing new work after pause/cancel has been observed
|
|
- let the item remain in a recoverable state according to current reconciliation rules
|
|
|
|
The first implementation should prefer correctness over aggressive continuation.
|
|
|
|
## Why This Improves Throughput
|
|
|
|
Under the current model, ten workers spend most of their time waiting on provider resolution.
|
|
|
|
Under the dual-pool model:
|
|
|
|
- a small resolver pool continues finding usable sources
|
|
- a larger downloader pool stays focused on byte transfer
|
|
|
|
This does not make provider resolution free, but it stops long resolution latency from occupying the same worker budget needed for real downloads.
|
|
|
|
The expected operator-visible result is:
|
|
|
|
- multiple downloader workers can show real transfer progress concurrently
|
|
- resolver workers remain visible as separate activity instead of appearing as fake download workers
|
|
|
|
## Testing Strategy
|
|
|
|
## Unit Tests
|
|
|
|
Extend downloader tests to cover:
|
|
|
|
- `resolve_song_row(...)` returning a resolved payload without downloading
|
|
- `download_resolved_song(...)` preserving existing progress and file-recording behavior
|
|
- compatibility wrapper `download_song_row(...)` still working
|
|
|
|
## Runner Tests
|
|
|
|
Add runner tests for:
|
|
|
|
- worker split calculation
|
|
- resolver workers feeding downloader workers through a queue
|
|
- successful completion of mixed resolved tasks
|
|
- pause and cancel behavior while the queue is non-empty
|
|
- clean worker shutdown after resolver completion
|
|
|
|
## Dashboard-Oriented Tests
|
|
|
|
Add ops tests to verify:
|
|
|
|
- resolver workers appear with resolver progress text
|
|
- downloader workers expose transfer metrics
|
|
- aggregate transfer stats ignore resolver-only workers
|
|
|
|
## Rollout Plan
|
|
|
|
1. refactor `CatalogDownloader` into resolve-only and download-only phases
|
|
2. add dual-pool execution path for the download stage in the runner
|
|
3. keep the old single-call wrapper for compatibility
|
|
4. update worker naming and dashboard expectations
|
|
5. run targeted NAS verification:
|
|
- confirm simultaneous non-zero transfer speed on more than one downloader worker
|
|
- confirm resolver workers remain visible separately
|
|
|
|
## Open Questions Resolved In This Design
|
|
|
|
- Should sync be changed to resolve URLs early?
|
|
- No.
|
|
|
|
- Should this iteration add persistent URL caching?
|
|
- No.
|
|
|
|
- Should resolver and downloader state share the same worker names?
|
|
- No. Separate names are clearer and better match reality.
|
|
|
|
- Should the first version require schema changes?
|
|
- No.
|
|
|
|
## Summary
|
|
|
|
The recommended change is to keep deferred snapshots exactly as they are and redesign only the download-stage execution model.
|
|
|
|
Instead of ten mixed workers doing `resolve + download`, the system should run a two-pool pipeline:
|
|
|
|
- a small resolver pool that turns deferred snapshots into ready download tasks
|
|
- a larger downloader pool that performs real file transfer
|
|
|
|
This is the smallest architecture change that directly targets the current bottleneck while preserving the existing sync model and database schema.
|