xiaoming/musicdl-catalog-sync-suite

Fork 0

Files

T

xiaoming 069af30dba Initial import: Music_Server, MusicFree, catalog-sync

2026-05-23 16:51:14 +08:00

13 KiB

Raw Permalink Blame History

Catalogsync Download Dual-Pool Pipeline Design

Goal

Improve real download concurrency without changing the sync stage or introducing a sync-time download URL cache.

The current bottleneck is not the byte-transfer implementation itself. The real bottleneck is that each download worker performs two very different jobs in sequence:

resolve a usable download source and URL
transfer audio bytes and record the finished file

In production, source resolution often takes tens of seconds while the final audio transfer may take around one second. As a result, DOWNLOAD_WORKERS=10 behaves like ten mixed workers waiting on resolve work instead of ten true download workers.

This design splits the download stage into a two-pool in-memory pipeline:

resolver pool
download pool

The sync stage remains unchanged. Songs are still stored as deferred snapshots and download URLs are still resolved at download time.

Confirmed Decisions

The following points were confirmed during design discussion:

do not change the sync stage
do not introduce a sync-time download URL cache as part of this iteration
focus only on download-stage behavior
the target outcome is that download workers spend their time on actual downloads instead of long source-resolution work
UI clarity matters:
- operators should be able to tell which workers are resolving and which workers are downloading
existing database schema should be preserved if possible

Scope

In Scope

split the download stage into resolver workers and downloader workers
keep the existing job, stage, and item lifecycle model
preserve existing deferred snapshot storage
preserve current local file recording and quality detection behavior
surface resolver activity and download activity clearly in worker state
keep pause, cancel, and recovery semantics compatible with the current runner

Out Of Scope

changing playlist sync behavior
persisting resolved download URLs across runs
redesigning source ranking logic
changing upload behavior
changing the meaning of song uniqueness in the database
introducing distributed workers or external queues

Problem Statement

Current download-stage flow:

a runner worker claims a download item
the worker calls CatalogDownloader.download_song_row(...)
inside that flow, the same worker:
- deserializes the deferred snapshot
- resolves a usable source across multiple providers
- downloads the final audio file
- records the local file

This model creates two user-visible problems:

most workers appear idle from a transfer perspective because they are blocked in source resolution
byte-transfer concurrency is much lower than the configured worker count

Recent production measurements showed the pattern clearly:

source resolution commonly takes about 77-83s
actual file download commonly takes about 1s

So the current worker pool is structurally spending most of its time in the wrong phase.

Approaches Considered

Approach A: Keep Single Pool And Only Improve UI

Show resolver activity more clearly, but keep each worker as resolve + download.

Pros:

smallest code change
no pipeline coordination logic

Cons:

does not materially improve true download concurrency
preserves the main performance bottleneck

Decision:

rejected for this task because it improves observability but not throughput

Approach B: Implement Dual Pools Inside `CatalogDownloader`

Move queueing and split-pool logic into CatalogDownloader.

Pros:

conceptually local to download code
useful for non-ops batch paths

Cons:

mismatches the current ops runner lifecycle
complicates job item ownership, pause, cancel, and worker naming
less natural for the NAS task center, which already manages workers at runner level

Decision:

not preferred for this iteration

Approach C: Implement Dual Pools At Download Stage Runner Level

Create a download-stage pipeline in the ops runner:

resolver workers claim items and produce ready-to-download tasks
downloader workers consume ready tasks and perform final transfer

Pros:

fits current job/stage/item orchestration naturally
keeps worker ownership explicit
lets dashboard show separate resolver and downloader workers
delivers real transfer concurrency gains without changing sync behavior

Cons:

more control-flow complexity in the runner
requires careful queue shutdown and pause/cancel handling

Decision:

recommended

Recommended Design

High-Level Architecture

During a download stage, the runner will create a bounded in-memory queue:

ready_queue

The stage will use two thread pools:

resolver pool
download pool

Resolver Pool Responsibilities

claim pending download items
check whether the song is already downloaded
build the download row
resolve a usable SongInfo with a valid download URL
publish a ResolvedDownloadTask into ready_queue
mark the item failed immediately if resolution cannot produce a usable download target

Download Pool Responsibilities

consume ResolvedDownloadTask instances from ready_queue
execute actual file download only
emit transfer progress
record local file metadata
mark the item succeeded or failed

This separates long-latency provider resolution from short, bandwidth-heavy transfer work.

New Internal Data Model

Introduce an internal in-memory task object for the stage, for example:

@dataclass
class ResolvedDownloadTask:
    item_id: int
    row: dict[str, Any]
    resolved_song_info: Any
    display_text: str
    target_library_root: Path

This object is not persisted to the database in this iteration.

Worker Model

The dashboard should show two worker families for a running download stage:

resolve-1, resolve-2, ...
download-1, download-2, ...

This is intentional. The operator should be able to distinguish:

workers currently finding a usable source
workers currently transferring bytes

transfer_stats should continue to count only workers with real transfer speed values.

Download Stage Flow

Step 1: Stage Startup

When the runner enters a download stage:

compute total worker budget from existing configuration
split it into resolver and downloader counts
create a bounded ready_queue
start resolver pool and downloader pool

Step 2: Item Resolution

Each resolver worker loops until:

no more claimable items remain
pause or cancel is requested
pipeline shutdown is triggered

For each claimed item:

load row data
skip immediately if already downloaded
emit resolver progress such as resolving source qq (1/6)
call a new downloader API that resolves but does not download
enqueue a ResolvedDownloadTask on success
mark failed on resolution failure

Step 3: Pure Download Execution

Each downloader worker loops until:

a shutdown sentinel is received
pause or cancel is requested and the queue has drained according to the chosen shutdown policy

For each resolved task:

emit starting download via <platform>
monitor file growth and emit transfer stats
record the local file on success
mark the item succeeded or failed

CatalogDownloader API Refactor

Keep the current public behavior but split the implementation into two explicit phases.

New Methods

resolve_song_row(...) -> ResolvedDownloadPayload | None
download_resolved_song(...) -> bool

Where:

resolve_song_row(...) handles snapshot deserialization, source resolution, target directory selection, and worker text for the resolver phase
download_resolved_song(...) performs only final download, monitor setup, file recording, and quality detection

Compatibility Method

Keep:

download_song_row(...)

But turn it into a compatibility wrapper:

resolve
download

This preserves existing unit-test entry points and any non-ops call sites.

Worker State Design

Resolver workers should update:

current_song_id
current_display_text
last_progress_text

Example messages:

resolving source qq (1/6)
resolving source kuwo (2/6)
resolved via qq

Downloader workers should update:

current_song_id
current_display_text
last_progress_text
downloaded_bytes
total_bytes
speed_bytes_per_sec
progress_percent

Example messages:

starting download via qq
12.00MB/48.00MB

Concurrency Split

Do not require a schema change or mandatory new env vars for the first version.

Recommended default behavior:

if total download worker budget is 1, use 1 resolver, 0 downloader is invalid, so coerce to single-thread compatibility path
if total is 2, use 1 resolver + 1 downloader
if total is >= 3, use approximately 30% resolver and 70% downloader

Initial recommended rule:

resolver_workers = max(1, min(3, total_workers // 3))
download_workers = max(1, total_workers - resolver_workers)

For DOWNLOAD_WORKERS=10, this gives:

3 resolver
7 downloader

This is a reasonable first cut and avoids over-investing worker budget in resolution.

Queue Design

Use a bounded in-memory queue to avoid resolver workers running too far ahead.

Recommended initial capacity:

download_workers * 2

Why bounded:

prevents unbounded memory growth
keeps resolution work closer to actual download demand
simplifies pause and cancel behavior

Pause, Cancel, And Shutdown Behavior

Pause

When pause is requested:

resolver workers stop claiming new items
downloader workers may finish in-flight downloads
stage reconciliation remains based on existing item states

This matches current expectations better than attempting hard interruption of active downloads.

Cancel

When cancel is requested:

resolver workers stop claiming new items immediately
downloader workers stop after their current task boundary where possible
no new resolved tasks should be enqueued after cancellation is observed

Queue Shutdown

After resolver workers finish, the runner should send explicit queue sentinels so downloader workers can exit cleanly once the queue drains.

Failure Handling

Resolution Failure

If resolution cannot produce a valid downloadable SongInfo:

mark the item failed immediately
do not enqueue it for download

Download Failure

If pure download fails after resolution:

mark the item failed
preserve the existing error formatting model

Resolver Success But Queue/Shutdown Race

If the pipeline is shutting down and a resolver has a resolved task ready:

prefer not enqueuing new work after pause/cancel has been observed
let the item remain in a recoverable state according to current reconciliation rules

The first implementation should prefer correctness over aggressive continuation.

Why This Improves Throughput

Under the current model, ten workers spend most of their time waiting on provider resolution.

Under the dual-pool model:

a small resolver pool continues finding usable sources
a larger downloader pool stays focused on byte transfer

This does not make provider resolution free, but it stops long resolution latency from occupying the same worker budget needed for real downloads.

The expected operator-visible result is:

multiple downloader workers can show real transfer progress concurrently
resolver workers remain visible as separate activity instead of appearing as fake download workers

Testing Strategy

Unit Tests

Extend downloader tests to cover:

resolve_song_row(...) returning a resolved payload without downloading
download_resolved_song(...) preserving existing progress and file-recording behavior
compatibility wrapper download_song_row(...) still working

Runner Tests

Add runner tests for:

worker split calculation
resolver workers feeding downloader workers through a queue
successful completion of mixed resolved tasks
pause and cancel behavior while the queue is non-empty
clean worker shutdown after resolver completion

Dashboard-Oriented Tests

Add ops tests to verify:

resolver workers appear with resolver progress text
downloader workers expose transfer metrics
aggregate transfer stats ignore resolver-only workers

Rollout Plan

refactor CatalogDownloader into resolve-only and download-only phases
add dual-pool execution path for the download stage in the runner
keep the old single-call wrapper for compatibility
update worker naming and dashboard expectations
run targeted NAS verification:
- confirm simultaneous non-zero transfer speed on more than one downloader worker
- confirm resolver workers remain visible separately

Open Questions Resolved In This Design

Should sync be changed to resolve URLs early?
- No.
Should this iteration add persistent URL caching?
- No.
Should resolver and downloader state share the same worker names?
- No. Separate names are clearer and better match reality.
Should the first version require schema changes?
- No.

Summary

The recommended change is to keep deferred snapshots exactly as they are and redesign only the download-stage execution model.

Instead of ten mixed workers doing resolve + download, the system should run a two-pool pipeline:

a small resolver pool that turns deferred snapshots into ready download tasks
a larger downloader pool that performs real file transfer

This is the smallest architecture change that directly targets the current bottleneck while preserving the existing sync model and database schema.

13 KiB Raw Permalink Blame History

Catalogsync Download Dual-Pool Pipeline Design

Goal

Confirmed Decisions

Scope

In Scope

Out Of Scope

Problem Statement

Approaches Considered

Approach A: Keep Single Pool And Only Improve UI

Approach B: Implement Dual Pools Inside CatalogDownloader

Approach C: Implement Dual Pools At Download Stage Runner Level

Recommended Design

High-Level Architecture

Resolver Pool Responsibilities

Download Pool Responsibilities

New Internal Data Model

Worker Model

Download Stage Flow

Step 1: Stage Startup

Step 2: Item Resolution

Step 3: Pure Download Execution

CatalogDownloader API Refactor

New Methods

Compatibility Method

Worker State Design

Concurrency Split

Queue Design

Pause, Cancel, And Shutdown Behavior

Pause

Cancel

Queue Shutdown

Failure Handling

Resolution Failure

Download Failure

Resolver Success But Queue/Shutdown Race

Why This Improves Throughput

Testing Strategy

Unit Tests

Runner Tests

Dashboard-Oriented Tests

Rollout Plan

Open Questions Resolved In This Design

Summary

13 KiB

Raw Permalink Blame History

Approach B: Implement Dual Pools Inside `CatalogDownloader`