Files
musicdl-catalog-sync-suite/catalog-sync/docs/superpowers/specs/2026-04-20-download-dual-pool-pipeline-design.md

13 KiB

Catalogsync Download Dual-Pool Pipeline Design

Goal

Improve real download concurrency without changing the sync stage or introducing a sync-time download URL cache.

The current bottleneck is not the byte-transfer implementation itself. The real bottleneck is that each download worker performs two very different jobs in sequence:

  1. resolve a usable download source and URL
  2. transfer audio bytes and record the finished file

In production, source resolution often takes tens of seconds while the final audio transfer may take around one second. As a result, DOWNLOAD_WORKERS=10 behaves like ten mixed workers waiting on resolve work instead of ten true download workers.

This design splits the download stage into a two-pool in-memory pipeline:

  • resolver pool
  • download pool

The sync stage remains unchanged. Songs are still stored as deferred snapshots and download URLs are still resolved at download time.

Confirmed Decisions

The following points were confirmed during design discussion:

  • do not change the sync stage
  • do not introduce a sync-time download URL cache as part of this iteration
  • focus only on download-stage behavior
  • the target outcome is that download workers spend their time on actual downloads instead of long source-resolution work
  • UI clarity matters:
    • operators should be able to tell which workers are resolving and which workers are downloading
  • existing database schema should be preserved if possible

Scope

In Scope

  • split the download stage into resolver workers and downloader workers
  • keep the existing job, stage, and item lifecycle model
  • preserve existing deferred snapshot storage
  • preserve current local file recording and quality detection behavior
  • surface resolver activity and download activity clearly in worker state
  • keep pause, cancel, and recovery semantics compatible with the current runner

Out Of Scope

  • changing playlist sync behavior
  • persisting resolved download URLs across runs
  • redesigning source ranking logic
  • changing upload behavior
  • changing the meaning of song uniqueness in the database
  • introducing distributed workers or external queues

Problem Statement

Current download-stage flow:

  1. a runner worker claims a download item
  2. the worker calls CatalogDownloader.download_song_row(...)
  3. inside that flow, the same worker:
    • deserializes the deferred snapshot
    • resolves a usable source across multiple providers
    • downloads the final audio file
    • records the local file

This model creates two user-visible problems:

  • most workers appear idle from a transfer perspective because they are blocked in source resolution
  • byte-transfer concurrency is much lower than the configured worker count

Recent production measurements showed the pattern clearly:

  • source resolution commonly takes about 77-83s
  • actual file download commonly takes about 1s

So the current worker pool is structurally spending most of its time in the wrong phase.

Approaches Considered

Approach A: Keep Single Pool And Only Improve UI

Show resolver activity more clearly, but keep each worker as resolve + download.

Pros:

  • smallest code change
  • no pipeline coordination logic

Cons:

  • does not materially improve true download concurrency
  • preserves the main performance bottleneck

Decision:

  • rejected for this task because it improves observability but not throughput

Approach B: Implement Dual Pools Inside CatalogDownloader

Move queueing and split-pool logic into CatalogDownloader.

Pros:

  • conceptually local to download code
  • useful for non-ops batch paths

Cons:

  • mismatches the current ops runner lifecycle
  • complicates job item ownership, pause, cancel, and worker naming
  • less natural for the NAS task center, which already manages workers at runner level

Decision:

  • not preferred for this iteration

Approach C: Implement Dual Pools At Download Stage Runner Level

Create a download-stage pipeline in the ops runner:

  • resolver workers claim items and produce ready-to-download tasks
  • downloader workers consume ready tasks and perform final transfer

Pros:

  • fits current job/stage/item orchestration naturally
  • keeps worker ownership explicit
  • lets dashboard show separate resolver and downloader workers
  • delivers real transfer concurrency gains without changing sync behavior

Cons:

  • more control-flow complexity in the runner
  • requires careful queue shutdown and pause/cancel handling

Decision:

  • recommended

High-Level Architecture

During a download stage, the runner will create a bounded in-memory queue:

  • ready_queue

The stage will use two thread pools:

  • resolver pool
  • download pool

Resolver Pool Responsibilities

  • claim pending download items
  • check whether the song is already downloaded
  • build the download row
  • resolve a usable SongInfo with a valid download URL
  • publish a ResolvedDownloadTask into ready_queue
  • mark the item failed immediately if resolution cannot produce a usable download target

Download Pool Responsibilities

  • consume ResolvedDownloadTask instances from ready_queue
  • execute actual file download only
  • emit transfer progress
  • record local file metadata
  • mark the item succeeded or failed

This separates long-latency provider resolution from short, bandwidth-heavy transfer work.

New Internal Data Model

Introduce an internal in-memory task object for the stage, for example:

@dataclass
class ResolvedDownloadTask:
    item_id: int
    row: dict[str, Any]
    resolved_song_info: Any
    display_text: str
    target_library_root: Path

This object is not persisted to the database in this iteration.

Worker Model

The dashboard should show two worker families for a running download stage:

  • resolve-1, resolve-2, ...
  • download-1, download-2, ...

This is intentional. The operator should be able to distinguish:

  • workers currently finding a usable source
  • workers currently transferring bytes

transfer_stats should continue to count only workers with real transfer speed values.

Download Stage Flow

Step 1: Stage Startup

When the runner enters a download stage:

  1. compute total worker budget from existing configuration
  2. split it into resolver and downloader counts
  3. create a bounded ready_queue
  4. start resolver pool and downloader pool

Step 2: Item Resolution

Each resolver worker loops until:

  • no more claimable items remain
  • pause or cancel is requested
  • pipeline shutdown is triggered

For each claimed item:

  1. load row data
  2. skip immediately if already downloaded
  3. emit resolver progress such as resolving source qq (1/6)
  4. call a new downloader API that resolves but does not download
  5. enqueue a ResolvedDownloadTask on success
  6. mark failed on resolution failure

Step 3: Pure Download Execution

Each downloader worker loops until:

  • a shutdown sentinel is received
  • pause or cancel is requested and the queue has drained according to the chosen shutdown policy

For each resolved task:

  1. emit starting download via <platform>
  2. monitor file growth and emit transfer stats
  3. record the local file on success
  4. mark the item succeeded or failed

CatalogDownloader API Refactor

Keep the current public behavior but split the implementation into two explicit phases.

New Methods

  • resolve_song_row(...) -> ResolvedDownloadPayload | None
  • download_resolved_song(...) -> bool

Where:

  • resolve_song_row(...) handles snapshot deserialization, source resolution, target directory selection, and worker text for the resolver phase
  • download_resolved_song(...) performs only final download, monitor setup, file recording, and quality detection

Compatibility Method

Keep:

  • download_song_row(...)

But turn it into a compatibility wrapper:

  1. resolve
  2. download

This preserves existing unit-test entry points and any non-ops call sites.

Worker State Design

Resolver workers should update:

  • current_song_id
  • current_display_text
  • last_progress_text

Example messages:

  • resolving source qq (1/6)
  • resolving source kuwo (2/6)
  • resolved via qq

Downloader workers should update:

  • current_song_id
  • current_display_text
  • last_progress_text
  • downloaded_bytes
  • total_bytes
  • speed_bytes_per_sec
  • progress_percent

Example messages:

  • starting download via qq
  • 12.00MB/48.00MB

Concurrency Split

Do not require a schema change or mandatory new env vars for the first version.

Recommended default behavior:

  • if total download worker budget is 1, use 1 resolver, 0 downloader is invalid, so coerce to single-thread compatibility path
  • if total is 2, use 1 resolver + 1 downloader
  • if total is >= 3, use approximately 30% resolver and 70% downloader

Initial recommended rule:

resolver_workers = max(1, min(3, total_workers // 3))
download_workers = max(1, total_workers - resolver_workers)

For DOWNLOAD_WORKERS=10, this gives:

  • 3 resolver
  • 7 downloader

This is a reasonable first cut and avoids over-investing worker budget in resolution.

Queue Design

Use a bounded in-memory queue to avoid resolver workers running too far ahead.

Recommended initial capacity:

  • download_workers * 2

Why bounded:

  • prevents unbounded memory growth
  • keeps resolution work closer to actual download demand
  • simplifies pause and cancel behavior

Pause, Cancel, And Shutdown Behavior

Pause

When pause is requested:

  • resolver workers stop claiming new items
  • downloader workers may finish in-flight downloads
  • stage reconciliation remains based on existing item states

This matches current expectations better than attempting hard interruption of active downloads.

Cancel

When cancel is requested:

  • resolver workers stop claiming new items immediately
  • downloader workers stop after their current task boundary where possible
  • no new resolved tasks should be enqueued after cancellation is observed

Queue Shutdown

After resolver workers finish, the runner should send explicit queue sentinels so downloader workers can exit cleanly once the queue drains.

Failure Handling

Resolution Failure

If resolution cannot produce a valid downloadable SongInfo:

  • mark the item failed immediately
  • do not enqueue it for download

Download Failure

If pure download fails after resolution:

  • mark the item failed
  • preserve the existing error formatting model

Resolver Success But Queue/Shutdown Race

If the pipeline is shutting down and a resolver has a resolved task ready:

  • prefer not enqueuing new work after pause/cancel has been observed
  • let the item remain in a recoverable state according to current reconciliation rules

The first implementation should prefer correctness over aggressive continuation.

Why This Improves Throughput

Under the current model, ten workers spend most of their time waiting on provider resolution.

Under the dual-pool model:

  • a small resolver pool continues finding usable sources
  • a larger downloader pool stays focused on byte transfer

This does not make provider resolution free, but it stops long resolution latency from occupying the same worker budget needed for real downloads.

The expected operator-visible result is:

  • multiple downloader workers can show real transfer progress concurrently
  • resolver workers remain visible as separate activity instead of appearing as fake download workers

Testing Strategy

Unit Tests

Extend downloader tests to cover:

  • resolve_song_row(...) returning a resolved payload without downloading
  • download_resolved_song(...) preserving existing progress and file-recording behavior
  • compatibility wrapper download_song_row(...) still working

Runner Tests

Add runner tests for:

  • worker split calculation
  • resolver workers feeding downloader workers through a queue
  • successful completion of mixed resolved tasks
  • pause and cancel behavior while the queue is non-empty
  • clean worker shutdown after resolver completion

Dashboard-Oriented Tests

Add ops tests to verify:

  • resolver workers appear with resolver progress text
  • downloader workers expose transfer metrics
  • aggregate transfer stats ignore resolver-only workers

Rollout Plan

  1. refactor CatalogDownloader into resolve-only and download-only phases
  2. add dual-pool execution path for the download stage in the runner
  3. keep the old single-call wrapper for compatibility
  4. update worker naming and dashboard expectations
  5. run targeted NAS verification:
    • confirm simultaneous non-zero transfer speed on more than one downloader worker
    • confirm resolver workers remain visible separately

Open Questions Resolved In This Design

  • Should sync be changed to resolve URLs early?

    • No.
  • Should this iteration add persistent URL caching?

    • No.
  • Should resolver and downloader state share the same worker names?

    • No. Separate names are clearer and better match reality.
  • Should the first version require schema changes?

    • No.

Summary

The recommended change is to keep deferred snapshots exactly as they are and redesign only the download-stage execution model.

Instead of ten mixed workers doing resolve + download, the system should run a two-pool pipeline:

  • a small resolver pool that turns deferred snapshots into ready download tasks
  • a larger downloader pool that performs real file transfer

This is the smallest architecture change that directly targets the current bottleneck while preserving the existing sync model and database schema.