Files

121 lines
4.8 KiB
Markdown

# Catalog Sync Design
## Goal
Build an independent catalog sync and download workflow that:
- extracts playlist-square and toplist sources from NetEase, QQ Music, and Kuwo
- stores `playlist pool -> playlist -> song` and derived `artist pool -> artist -> song`
- skips duplicate downloads by `(platform, remote_song_id)`
- prefers highest available quality and falls back when needed
- supports pausing on low disk space and continuing in a new local directory
- keeps storage metadata compatible with local paths, cloud-drive paths, and bucket/key style object storage
## Scope
### In Scope
- Independent Python CLI entrypoint
- SQLite schema for catalog, file, and task state
- Source collectors for:
- NetEase playlist square + toplists
- QQ playlist square + toplists
- Kuwo playlist square + toplists
- Reuse existing platform `parseplaylist()` and download logic where practical
- Derived artist pool updates during playlist sync
- Lazy artist enrichment metadata and hooks
- Local download dedupe and disk-space prompts
- Storage schema compatible with future uploads
### Out of Scope
- Full cross-platform song canonicalization
- GUI integration
- Production-ready 123 cloud upload implementation
- Streaming upload while downloading
## Constraints
- Prefer reuse of existing source clients under `musicdl.modules.sources`
- Avoid new mandatory dependencies where stdlib is sufficient
- Keep first version recoverable and inspectable from local files and SQLite
- Preserve compatibility with the existing `musicdl` package and console script
## Architecture
The new workflow lives in a dedicated package under `musicdl.catalogsync`. Collectors fetch playlist candidates per source and pool kind, then a sync layer normalizes and persists them. Playlist parsing reuses the existing per-platform clients to resolve tracks into `SongInfo` objects, which are then stored into catalog tables and used to derive artist pool membership. A download planner reads undispatched songs from the database, skips anything already represented by an active local file asset, and otherwise delegates the actual media fetch to existing source download logic.
Storage metadata is modeled with a logical file layer plus a location layer. `file_assets` describes the downloaded media version for a song, while `file_locations` records where that file lives. The first implementation only writes local locations, but the schema supports cloud-drive or bucket/key locations later without changing the song-level model.
## Data Model
### Catalog
- `playlist_pools`
- `playlists`
- `pool_playlists`
- `artist_pools`
- `artists`
- `pool_artists`
- `songs`
- `playlist_songs`
- `artist_songs`
### File and Storage
- `storage_backends`
- `file_assets`
- `file_locations`
- `download_tasks`
## Key Behaviors
### Playlist Sync
1. Fetch playlist-square and toplist candidates for selected sources.
2. Upsert pool rows and playlist rows.
3. Link pools to playlists.
4. For selected playlists, call platform `parseplaylist()` to resolve songs.
5. Upsert song rows and `playlist_songs`.
6. Extract artists from raw platform metadata when possible, otherwise from normalized singer strings.
7. Upsert artists and attach them to derived artist pools and `artist_songs`.
### Download Dedupe
- A song is considered already owned when it has an active local `file_location`.
- Dedupe key at song level is `(platform, remote_song_id)`.
- The first implementation keeps one preferred file asset per song. Future uploads add locations, not duplicate song rows.
### Quality Selection
- Existing platform clients already attempt higher qualities first.
- The workflow treats the returned file as the chosen asset and persists:
- quality label
- extension
- file size
- hash when available or computable
### Low Disk Space
- Before each download, check free space for the active local backend.
- If insufficient, pause and prompt for a new local directory.
- Upsert a new local backend row and continue subsequent downloads there.
- Already downloaded files remain linked to their original backend.
### Future Upload Compatibility
- `storage_backends` represents local FS, cloud-drive roots, or object-storage containers.
- `file_locations.container_name + locator` can represent:
- local root + relative path
- cloud root + remote path
- bucket + key
- Future upload jobs can attach new non-local locations to an existing `file_asset`.
## Acceptance Criteria
- Selected source collectors can persist playlist-square and toplist rows into SQLite.
- Playlist sync can populate songs and derived artists from at least the supported source set.
- Download command skips songs already backed by active local file locations.
- Low-space prompt can switch to a new local directory and continue.
- Tests cover schema creation, normalization, derived artist sync, dedupe checks, and collector parsing helpers.