Metadata-Version: 2.4
Name: xc-dl
Version: 0.1.2
Summary: Xeno-Canto bioacoustic corpus downloader for deep learning research
License-File: LICENSE
Requires-Python: >=3.9
Requires-Dist: aiofiles>=24.1
Requires-Dist: aiohttp>=3.9
Requires-Dist: filelock>=3.13
Requires-Dist: prompt-toolkit>=3.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: structlog>=24.1
Requires-Dist: typer>=0.9
Provides-Extra: audio
Requires-Dist: soundfile>=0.12; extra == 'audio'
Description-Content-Type: text/markdown

![xc-dl banner](assets/banner.png)

# xc-dl

A high-performance, async command-line tool for downloading and curating bioacoustic datasets from [Xeno-canto](https://xeno-canto.org/) for deep learning research.

**xc-dl** handles metadata fetching, concurrent audio downloads, format conversion (via ffmpeg), integrity verification, and HPC-scale distributed workflows -- all with resume support and progress tracking.

## Installation

```bash
pip install xc-dl
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add xc-dl
```

### Requirements

- Python 3.9+
- A [Xeno-canto API key](https://xeno-canto.org/account) (free)
- [ffmpeg](https://ffmpeg.org/) (optional, required for audio conversion)

## Quick Start

```bash
# Set your API key
export XC_API_KEY="your-key-here"

# Preview a query (shows species, quality, duration stats)
xc-dl search 'grp:birds cnt:"South Africa" q:">C"' --full

# Download all matching recordings
xc-dl download 'grp:birds cnt:"South Africa" q:">C"'

# Download and convert to 16kHz mono WAV
xc-dl download 'grp:birds cnt:"South Africa" q:">C"' --convert wav

# Download a 10% random sample (deterministic)
xc-dl download 'grp:birds cnt:"South Africa"' --portion 10

# Cap total download size
xc-dl download 'grp:birds' --max-storage 500GB

# Verify dataset integrity
xc-dl check --all --report status.json

# Re-download corrupted or missing files
xc-dl check --all --fix
```

## Configuration

xc-dl reads configuration from a YAML file (default: `./xc-dl.yaml`), environment variables, and CLI flags. CLI flags take highest precedence.

### Config File

```yaml
general:
  api_key: "your-key"
  data_dir: "./xc-dataset"
  concurrency: 8
  metadata_concurrency: 4
  rate_limit: 5.0
  log_level: "INFO"
  log_file: "./download.log"

download:
  convert: "wav"
  convert_sample_rate: 16000
  convert_channels: 1
  convert_bit_depth: 16
  max_retries: 3

queries:
  south_africa_birds:
    query: 'grp:birds cnt:"South Africa" q:">C"'
    description: "South African birds, quality > C"
```

### Environment Variables

| Variable | Description |
|----------|-------------|
| `XC_API_KEY` | Xeno-canto API key (overrides config file) |

### CLI Options Reference

| Option | Default | Description |
|--------|---------|-------------|
| `--config, -c` | `./xc-dl.yaml` | Path to config file |
| `--data-dir, -d` | `./xc-dataset` | Root output directory |
| `--api-key` | | XC API key |
| `--log-level` | `INFO` | Logging verbosity (DEBUG/INFO/WARNING/ERROR) |
| `--log-file` | | JSON-lines log file path |
| `--concurrency, -j` | `8` | Max download parallelism |
| `--metadata-concurrency` | `4` | Max API fetch parallelism |
| `--rate-limit` | `5.0` | Max API requests/sec |
| `--verbose, -v` | off | Show log output on console |
| `--dry-run` | off | Show plan without executing |
| `--version, -V` | | Show version |

## Subcommands

### `search`

Preview a query without downloading. Shows species counts, quality distribution, recording types, estimated duration, and disk usage.

```bash
xc-dl search 'gen:Tyto sp:alba' --full
xc-dl search 'grp:birds cnt:Brazil' --format json
xc-dl search 'en:"Common Ostrich"' --save-query ostrich
```

| Option | Description |
|--------|-------------|
| `--full` | Fetch all pages for exact statistics |
| `--format` | Output format: `table` (default), `json`, `csv` |
| `--save-query NAME` | Save query as a named preset |

### `download`

The main pipeline: fetch metadata, download audio, optionally convert. Supports resume -- interrupted downloads can be continued by re-running the same command.

```bash
# Basic download
xc-dl download 'grp:birds cnt:"South Africa"'

# Metadata only (no audio download)
xc-dl download 'grp:birds cnt:"South Africa"' --metadata-only

# With conversion to 16kHz mono WAV
xc-dl download 'grp:birds' --convert wav --convert-sample-rate 16000

# From a saved preset
xc-dl download --from-config south_africa_birds

# Random 10% sample (deterministic based on query hash)
xc-dl download 'grp:birds' --portion 10

# Stop after 500GB downloaded
xc-dl download 'grp:birds' --max-storage 500GB
```

| Option | Default | Description |
|--------|---------|-------------|
| `--metadata-only` | off | Fetch metadata only |
| `--per-page` | `500` | Records per API page |
| `--convert` | | Convert format: `wav`, `flac`, `ogg` |
| `--convert-sample-rate` | `16000` | Target sample rate (Hz) |
| `--convert-channels` | `1` | Target channels (1=mono) |
| `--convert-bit-depth` | `16` | Bit depth (16 or 32) |
| `--skip-existing/--no-skip-existing` | on | Skip already-verified files |
| `--retry-failed/--no-retry-failed` | on | Retry previously failed files |
| `--max-retries` | `3` | Max download retry attempts |
| `--portion` | | Download a random portion (0-100%) |
| `--max-storage` | | Stop after this much data (e.g. `500GB`, `1TiB`) |
| `--from-config NAME` | | Use a named query preset |
| `--from-file-list PATH` | | Download from HPC file list |
| `--generate-file-lists` | off | Create per-node file lists |
| `--num-nodes` | `1` | Number of HPC nodes |

#### Pipeline Architecture

Downloads, conversions, and verification run concurrently through an `asyncio.Queue`-based pipeline:

```
Metadata Fetch ──> Download Queue ──> Convert Queue
                   (N workers)        (N/2 workers)
```

- Downloads start as metadata becomes available
- Conversions begin as soon as each file finishes downloading
- Ctrl+C triggers graceful shutdown: in-progress items complete, state is saved
- Re-running the same command resumes from where it stopped

#### Dataset Config

Each download writes a `download-config.yaml` to the dataset directory recording the query, parameters, and xc-dl version (never the API key). This makes datasets reproducible.

### `check`

Verify integrity of an existing dataset via SHA-256 checksums and optional deep audio decoding.

```bash
xc-dl check --all
xc-dl check --all --deep         # Full decode (slower, catches bitstream corruption)
xc-dl check --all --fix           # Re-download corrupted/missing files
xc-dl check --all --report status.json
```

| Option | Description |
|--------|-------------|
| `--all` | Check all recordings |
| `--query` | Check only recordings matching query |
| `--deep` | Full audio decode to detect corruption |
| `--fix` | Re-download files that fail verification |
| `--report PATH` | Write JSON report to file |

## Dataset Structure

```
xc-dataset/
  catalog.jsonl                                     # Central catalog (one JSON per line)
  download-config.yaml                              # Query and parameters used
  dataset_manifest_south-africa-birds_v1.txt        # Selector file (view over catalog)
  metadata/
    Strigidae/Tyto/Tyto_alba/
      XC00694038_Tyto_alba.json                     # Sidecar metadata (raw API + xc-dl state)
  original_recordings/
    Strigidae/Tyto/Tyto_alba/
      XC00694038_Tyto_alba.mp3                      # Original audio
  resampled_16khz/
    Strigidae/Tyto/Tyto_alba/
      XC00694038_Tyto_alba.wav                      # Converted audio
  .progress/
    metadata-fetch.json                             # Resume state for metadata
    download-state.json                             # Resume state for downloads
```

Files are organized by `Family/Genus/Genus_species/` and named with zero-padded XC IDs (`XC00694038`).

## HPC / Distributed Workflows

For large-scale downloads across multiple nodes (e.g. on a Slurm cluster):

```bash
# 1. Fetch metadata on the login node
xc-dl download 'grp:birds' --metadata-only

# 2. Generate per-node file lists
xc-dl download --generate-file-lists --num-nodes 16

# 3. Submit array job (each node downloads its portion)
# In your Slurm script:
xc-dl download --from-file-list file-lists/node-${SLURM_ARRAY_TASK_ID}.txt
```

File lists are written to `xc-dataset/file-lists/`:
- `full-list.txt` -- all recording IDs
- `node-0.txt` through `node-N.txt` -- per-node chunks

<details>
<summary><h2>Planned Features</h2></summary>

The following features are planned but not yet implemented:

### Sonogram Download Support
`--include-sonograms` and `--sonogram-size` flags will download sonogram images to a `sonogram_<size>/` parallel directory tree.

### `update-taxonomy` Subcommand
Auto-download IOC World Bird List CSV to refresh the genus-to-family mapping cache. Perhaps we could also auto update this based on metadata we pull from Xeno-canto, when a new species with genus/family is discovered update the update the taxonomy.

### CSV Output for Search
`--format csv` option for machine-readable search output.

### Interactive Fuzzy Search TUI
`xc-dl search --interactive` using prompt_toolkit for live query building with auto-complete and preview counts.

### Dataset or query visualisation
`pip install xc-dl[viz]` downloads the visualisation packages (eg. datashaders) that allows the plotting of a map from where all the recordings in a dataset comes from. With perhaps someother figures like, nested pie chart for class composition or a dentogram for call type visualisation.

</details>

## Disclaimer

- Code authored by Claude Opus 4.6, fully reviewed by a human
- Banner generated using Gemini Imagen 3
- This project is **not** officially associated with Xeno-canto
- The authors are not responsible for misuse of this project
- Users must respect the licenses under which original data contributors provided their recordings
- For large downloads, please inform the [Xeno-canto team](https://xeno-canto.org/about) and respect their rate limits
- Special thanks to the [Xeno-canto project](https://xeno-canto.org/) for hosting the data and to all contributors who further bioacoustic research
