xiaoming/musicdl-catalog-sync-suite

Fork 0

Files

T

xiaoming 069af30dba Initial import: Music_Server, MusicFree, catalog-sync

2026-05-23 16:51:14 +08:00

11 KiB

Raw Blame History

Catalogsync Resolver Source Ranking Design

Goal

Improve resolver throughput without sacrificing cross-run learning by introducing a persistent, isolated source-ranking store.

The resolver should keep treating the song's original platform as the preferred source, but fallback order should become adaptive instead of fixed. The adaptive order must:

learn over time across jobs and restarts
be grouped by original source instead of using one global ranking
stay isolated from the main catalog business tables
preserve the current "keep trying later sources if earlier ones fail" behavior

Confirmed Decisions

The following points were confirmed during design discussion:

the ranking model is grouped by original source
statistics must persist across tasks and service restarts
the statistics store must be isolated from the main business schema
the original source is still tried first
after the warmup threshold is reached, fallback should try the top two ranked sources first
if the top two fallback sources fail, resolver must continue trying the remaining sources

Scope

In Scope

add a dedicated resolver statistics SQLite side database
record persistent fallback attempt and success statistics by (origin_source, candidate_source)
use fallback statistics to reorder sources after a warmup threshold
keep the original source as the first attempt
preserve existing resolver matching and candidate selection logic within a single source
cover side-database initialization, repository methods, ranking logic, and resolver behavior with tests

Out Of Scope

changing sync-stage behavior
caching download URLs across runs
changing song uniqueness rules
replacing the current matching heuristics inside a source
adding a UI for resolver statistics in this iteration
distributed or external metrics storage

Problem Statement

The current resolver still spends too much time in fallback traversal.

Today, resolver behavior is:

derive the original platform as preferred_source
try the preferred source first
if preferred-source fast return does not happen, continue through the configured fallback list
within fallback traversal, source order is static and does not learn from production outcomes

This causes two operational problems:

fallback time is longer than necessary because low-yield sources keep being retried early
the system does not accumulate knowledge from prior jobs, so every restart returns to the same static ordering

The result is that resolver throughput remains bursty even after the dual-pool pipeline work, because the ready queue is still fed by a fallback strategy that does not adapt.

Approaches Considered

Approach A: Global Ranking For All Sources

Keep one success-rate table for all candidate sources regardless of original platform.

Pros:

simplest data model
easiest ranking query

Cons:

mixes very different source relationships
large-volume platforms can dominate the ranking
does not reflect that qq -> kuwo and netease -> kuwo may behave differently

Decision:

rejected because the learning model should follow the original source

Approach B: In-Memory Per-Run Learning Only

Track statistics only for the current job and discard them at task end.

Pros:

no schema work
easy to experiment with

Cons:

restarts lose all learning
long warmup every time
directly conflicts with the requirement for cross-run reuse

Decision:

rejected

Approach C: Persistent Side Database Grouped By Original Source

Store statistics in a dedicated SQLite side database keyed by original source and fallback source.

Pros:

matches the confirmed grouping model
survives restarts and future jobs
keeps analytics-style tables isolated from the main business schema
easy to evolve independently from catalog tables

Cons:

requires one more database file and repository
adds coordination between resolver and statistics store

Decision:

recommended

Recommended Design

High-Level Architecture

Add a dedicated resolver statistics store, for example:

resolver_stats.db

This database is initialized separately from catalogsync.db and contains only resolver-learning tables.

The main download flow remains:

build target_song_info
determine preferred_source
try preferred source first
reorder fallback sources using persistent statistics when warmup criteria are met
try ranked top two fallback sources first
if still unresolved, continue the remaining fallback sources in ranked order

The existing resolver still owns matching, candidate picking, and final candidate selection inside each source.

Statistics Model

The learning key is:

origin_source
candidate_source

Where:

origin_source is the normalized original platform for the song being resolved
candidate_source is a fallback source actually attempted after the original source path failed

Statistics are recorded only for fallback attempts. Preferred-source attempts are not stored in this side database for the first iteration because the ranking problem is specifically about fallback order.

Stored Counters

Each row should persist:

origin_source
candidate_source
attempt_count
resolve_success_count
last_attempt_at
last_success_at
created_at
updated_at

Warmup and Ranking Rules

Warmup Threshold

The warmup threshold is not global song count. It is the total fallback sample count for a specific origin_source.

Example:

qq fallback learning activates only after the sum of all qq -> * fallback attempts reaches 1000
netease fallback learning activates independently after the sum of all netease -> * fallback attempts reaches 1000

Ranking Formula

Use a smoothed success rate:

(resolve_success_count + 1) / (attempt_count + 2)

This avoids unstable rankings when sample counts are still low.

Ranked Traversal

For a song with original source origin_source:

always try preferred_source first
if preferred source does not resolve a high-confidence downloadable result, enter fallback
if origin_source warmup threshold is not met:
- keep the configured fallback order
if origin_source warmup threshold is met:
- sort fallback candidates by smoothed success rate, highest first
- preserve configured order as the tie-breaker
- try the top two ranked fallback sources first
- if both fail, continue with the remaining ranked fallback sources

This preserves completeness while improving average-case resolution speed.

Resolver Flow Changes

Resolver source ordering should become a two-phase plan:

Phase 1: Preferred Source

derive preferred_source from the snapshot or row platform
try preferred-source refresh
try preferred-source search
if a preferred-source high-confidence result is found, return immediately

Phase 2: Ranked Fallback

build fallback candidates from configured download_sources excluding preferred_source
ask the resolver stats repository for the ranked order for this origin_source
attempt fallback sources in that order
after each fallback attempt:
- record one attempt
- if that source resolves a usable candidate, record one success and stop
if a fallback source fails to produce a usable candidate, continue to the next source

The resolver should still stop at the first acceptable fallback success in this iteration rather than exhaustively scanning later sources for a possibly better file.

Side Database Schema

The side database should stay minimal for the first version.

Table: `resolver_source_stats`

origin_source TEXT NOT NULL
candidate_source TEXT NOT NULL
attempt_count INTEGER NOT NULL DEFAULT 0
resolve_success_count INTEGER NOT NULL DEFAULT 0
last_attempt_at TEXT
last_success_at TEXT
created_at TEXT DEFAULT CURRENT_TIMESTAMP
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
primary key: (origin_source, candidate_source)

Recommended indexes:

primary key already covers lookup by (origin_source, candidate_source)
index on (origin_source) for ranking queries

Repository Boundary

Introduce a dedicated repository for the side database, for example:

ResolverStatsRepository

Responsibilities:

initialize side-database schema
upsert attempt and success counters
report total fallback samples for an origin source
return ranked fallback candidates for an origin source given the configured fallback list

The main CatalogRepository should not absorb this responsibility. Keeping the side database behind a dedicated repository keeps the separation explicit and prevents statistics logic from leaking into core business persistence.

Configuration and File Layout

Add a dedicated resolver statistics database path derived from the application root, for example:

<APP_HOME>/data/resolver_stats.db

This path should be configurable but should default automatically so current operators do not need new setup work.

The service should initialize both:

main catalog database
resolver statistics side database

Service startup should not fail if the side database is empty; it should be created on demand.

Error Handling

Resolver statistics must not become a single point of failure.

If the side database update fails:

do not fail the actual download item
log the statistics error
continue resolver fallback using the best available in-memory ordering for that invocation

If the side database ranking query fails:

fall back to configured source order

This keeps the ranking system opportunistic rather than mission-critical.

Testing Strategy

Tests should cover:

side-database schema creation
isolated side-database repository queries and updates
warmup not reached:
- configured fallback order is preserved
warmup reached:
- fallback order is re-ranked by per-origin-source statistics
top-two-first behavior:
- top two ranked fallback sources are attempted before the rest
continuation behavior:
- if top two fail, later sources are still attempted
grouping behavior:
- qq ranking does not affect netease ranking
graceful degradation:
- side-database failure falls back to configured order instead of failing the item

Acceptance Criteria

resolver statistics are stored in a dedicated SQLite side database rather than the main business database
fallback statistics persist across jobs and service restarts
ranking is grouped by original source
before the warmup threshold, fallback order matches configured source order
after the warmup threshold, top two fallback candidates for an origin source are tried first according to smoothed success rate
if the top two fallback candidates fail, resolver still attempts the remaining fallback sources
statistics-store failures do not fail the download item outright
automated tests cover ranking, grouping, warmup, and fallback-to-configured-order behavior

11 KiB Raw Blame History

Catalogsync Resolver Source Ranking Design

Goal

Confirmed Decisions

Scope

In Scope

Out Of Scope

Problem Statement

Approaches Considered

Approach A: Global Ranking For All Sources

Approach B: In-Memory Per-Run Learning Only

Approach C: Persistent Side Database Grouped By Original Source

Recommended Design

High-Level Architecture

Statistics Model

Stored Counters

Warmup and Ranking Rules

Warmup Threshold

Ranking Formula

Ranked Traversal

Resolver Flow Changes

Phase 1: Preferred Source

Phase 2: Ranked Fallback

Side Database Schema

Table: resolver_source_stats

Repository Boundary

Configuration and File Layout

Error Handling

Testing Strategy

Acceptance Criteria

11 KiB

Raw Blame History

Table: `resolver_source_stats`