Metadata-Version: 2.4
Name: mlx-audio
Version: 0.2.10
Summary: MLX-Audio is a package for inference of text-to-speech (TTS) and speech-to-speech (STS) models locally on your Mac using MLX
Author-email: Prince Canuma <prince.gdt@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Blaizzy/mlx-audio
Project-URL: Repository, https://github.com/Blaizzy/mlx-audio
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx>=0.25.2
Requires-Dist: numpy>=1.26.4
Requires-Dist: huggingface_hub>=0.27.0
Requires-Dist: transformers<5.0.0,>=4.49.0
Requires-Dist: hf_transfer
Requires-Dist: mlx-lm<0.30.0,>=0.18.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: sounddevice>=0.5.1
Requires-Dist: soundfile>=0.13.1
Requires-Dist: pyloudnorm>=0.1.1
Requires-Dist: librosa==0.11.0
Requires-Dist: mlx-audio[all]
Provides-Extra: stt
Requires-Dist: tiktoken>=0.9.0; extra == "stt"
Requires-Dist: mistral-common[audio]; extra == "stt"
Provides-Extra: tts
Requires-Dist: misaki[en]>=0.8.2; extra == "tts"
Requires-Dist: loguru>=0.7.3; extra == "tts"
Requires-Dist: num2words>=0.5.14; extra == "tts"
Requires-Dist: spacy>=3.8.4; extra == "tts"
Requires-Dist: phonemizer-fork>=3.3.2; extra == "tts"
Requires-Dist: espeakng-loader>=0.2.4; extra == "tts"
Requires-Dist: sentencepiece>=0.2.0; extra == "tts"
Requires-Dist: einops>=0.8.1; extra == "tts"
Requires-Dist: omegaconf==2.3.0; extra == "tts"
Requires-Dist: einx==0.3.0; extra == "tts"
Requires-Dist: dacite>=1.9.2; extra == "tts"
Provides-Extra: server
Requires-Dist: fastapi>=0.95.0; extra == "server"
Requires-Dist: uvicorn>=0.22.0; extra == "server"
Provides-Extra: sts
Requires-Dist: mlx-audio[stt,tts]; extra == "sts"
Requires-Dist: mlx-vlm>=0.1.27; extra == "sts"
Requires-Dist: fastrtc[stt,vad]==0.0.34; extra == "sts"
Requires-Dist: webrtcvad>=2.0.10; extra == "sts"
Provides-Extra: all
Requires-Dist: mlx-audio[server,sts,stt,tts]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=1.0.0; extra == "dev"
Dynamic: license-file

# MLX-Audio

The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

## Features

- Fast inference optimized for Apple Silicon (M series chips)
- Multiple model architectures for TTS, STT, and STS
- Multilingual support across models
- Voice customization and cloning capabilities
- Adjustable speech speed control
- Interactive web interface with 3D audio visualization
- OpenAI-compatible REST API
- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
- Swift package for iOS/macOS integration

## Installation

```bash
pip install mlx-audio
```

For development or web interface:

```bash
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"
```

## Quick Start

### Command Line

```bash
# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!"

# With voice selection and speed adjustment
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --voice af_heart --speed 1.2

# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --play
```

### Python API

```python
from mlx_audio.tts.utils import load_model

# Load model
model = load_model("mlx-community/Kokoro-82M-bf16")

# Generate speech
for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
    print(f"Generated {result.audio.shape[0]} samples")
    # result.audio contains the waveform as mx.array
```

## Supported Models

### Text-to-Speech (TTS)

| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Kokoro** | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | [mlx-community/Kokoro-82M-bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16) |
| **CSM** | Conversational Speech Model with voice cloning | EN | [mlx-community/csm-1b](https://huggingface.co/mlx-community/csm-1b) |
| **Dia** | Dialogue-focused TTS | EN | [mlx-community/Dia-1.6B-bf16](https://huggingface.co/mlx-community/Dia-1.6B-bf16) |
| **OuteTTS** | Efficient TTS model | EN | [mlx-community/OuteTTS-0.2-500M](https://huggingface.co/mlx-community/OuteTTS-0.2-500M) |
| **Spark** | SparkTTS model | EN, ZH | [mlx-community/SparkTTS-0.5B-bf16](https://huggingface.co/mlx-community/SparkTTS-0.5B-bf16) |
| **Chatterbox** | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | [mlx-community/Chatterbox-bf16](https://huggingface.co/mlx-community/Chatterbox-bf16) |
| **Soprano** | High-quality TTS | EN | [mlx-community/Soprano-bf16](https://huggingface.co/mlx-community/Soprano-bf16) |

### Speech-to-Text (STT)

| Model | Description | Languages | Repo |
|-------|-------------|-----------|------|
| **Whisper** | OpenAI's robust STT model | 99+ languages | [mlx-community/whisper-large-v3-mlx](https://huggingface.co/mlx-community/whisper-large-v3-mlx) |
| **Parakeet** | NVIDIA's accurate STT | EN | [mlx-community/parakeet-tdt-0.6b-v2](https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v2) |
| **Voxtral** | Mistral's speech model | Multiple | [mlx-community/Voxtral-Mini-3B-2507-bf16](https://huggingface.co/mlx-community/Voxtral-Mini-3B-2507-bf16) |

### Speech-to-Speech (STS)

| Model | Description | Use Case | Repo |
|-------|-------------|----------|------|
| **SAM-Audio** | Text-guided source separation | Extract specific sounds | [mlx-community/sam-audio-large](https://huggingface.co/mlx-community/sam-audio-large) |
| **MossFormer2 SE** | Speech enhancement | Noise removal | [starkdmi/MossFormer2_SE_48K_MLX](https://huggingface.co/starkdmi/MossFormer2_SE_48K_MLX) |

## Model Examples

### Kokoro TTS

Kokoro is a fast, multilingual TTS model with 54 voice presets.

```python
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Kokoro-82M-bf16")

# Generate with different voices
for result in model.generate(
    text="Welcome to MLX-Audio!",
    voice="af_heart",  # American female
    speed=1.0,
    lang_code="a"  # American English
):
    audio = result.audio
```

**Available Voices:**
- American English: `af_heart`, `af_bella`, `af_nova`, `af_sky`, `am_adam`, `am_echo`, etc.
- British English: `bf_alice`, `bf_emma`, `bm_daniel`, `bm_george`, etc.
- Japanese: `jf_alpha`, `jm_kumo`, etc.
- Chinese: `zf_xiaobei`, `zm_yunxi`, etc.

**Language Codes:**
| Code | Language | Note |
|------|----------|------|
| `a` | American English | Default |
| `b` | British English | |
| `j` | Japanese | Requires `pip install misaki[ja]` |
| `z` | Mandarin Chinese | Requires `pip install misaki[zh]` |
| `e` | Spanish | |
| `f` | French | |

### CSM (Voice Cloning)

Clone any voice using a reference audio sample:

```bash
mlx_audio.tts.generate \
    --model mlx-community/csm-1b \
    --text "Hello from Sesame." \
    --ref_audio ./reference_voice.wav \
    --play
```

### Whisper STT

```python
from mlx_audio.stt.utils import load_model, transcribe

model = load_model("mlx-community/whisper-large-v3-mlx")
result = transcribe("audio.wav", model=model)
print(result["text"])
```

### SAM-Audio (Source Separation)

Separate specific sounds from audio using text prompts:

```python
from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio

model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")

batch = processor(
    descriptions=["A person speaking"],
    audios=["mixed_audio.wav"],
)

result = model.separate_long(
    batch.audios,
    descriptions=batch.descriptions,
    anchors=batch.anchor_ids,
    chunk_seconds=10.0,
    overlap_seconds=3.0,
    ode_opt={"method": "midpoint", "step_size": 2/32},
)

save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")
```

### MossFormer2 (Speech Enhancement)

Remove noise from speech recordings:

```python
from mlx_audio.sts import MossFormer2SEModel, save_audio

model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)
```

## Web Interface & API Server

MLX-Audio includes a modern web interface and OpenAI-compatible API.

### Starting the Server

```bash
# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000

# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run dev
```

### API Endpoints

**Text-to-Speech** (OpenAI-compatible):
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
  --output speech.wav
```

**Speech-to-Text**:
```bash
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=mlx-community/whisper-large-v3-mlx"
```

## Quantization

Reduce model size and improve performance with quantization using the convert script:

```bash
# Convert and quantize to 4-bit
python -m mlx_audio.convert \
    --hf-path prince-canuma/Kokoro-82M \
    --mlx-path ./Kokoro-82M-4bit \
    --quantize \
    --q-bits 4 \
    --upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)

# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
    --hf-path prince-canuma/Kokoro-82M \
    --mlx-path ./Kokoro-82M-bf16 \
    --dtype bfloat16 \
    --upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)
```

**Options:**
| Flag | Description |
|------|-------------|
| `--hf-path` | Source Hugging Face model or local path |
| `--mlx-path` | Output directory for converted model |
| `-q, --quantize` | Enable quantization |
| `--q-bits` | Bits per weight (4, 6, or 8) |
| `--q-group-size` | Group size for quantization (default: 64) |
| `--dtype` | Weight dtype: `float16`, `bfloat16`, `float32` |
| `--upload-repo` | Upload converted model to HF Hub |

**Pre-quantized models available:**
- [mlx-community/Kokoro-82M-4bit](https://huggingface.co/mlx-community/Kokoro-82M-4bit)
- [mlx-community/Kokoro-82M-6bit](https://huggingface.co/mlx-community/Kokoro-82M-6bit)
- [mlx-community/Kokoro-82M-8bit](https://huggingface.co/mlx-community/Kokoro-82M-8bit)
- [mlx-community/Kokoro-82M-bf16](https://huggingface.co/mlx-community/Kokoro-82M-bf16)

## Swift Integration

Native Swift package for iOS and macOS apps.

### Installation

Add to your `Package.swift`:

```swift
dependencies: [
    .package(url: "https://github.com/Blaizzy/mlx-audio.git", from: "0.2.5")
]
```

### Usage

```swift
import MLXAudio

// Create session (auto-downloads model)
let session = try await MarvisSession(voice: .conversationalA)

// Generate speech (auto-plays)
let result = try await session.generate(for: "Hello from Swift!")

// Streaming generation
for try await chunk in session.stream(text: "Streaming audio...") {
    print("Chunk: \(chunk.sampleCount) samples")
}
```

**Supported Platforms:** macOS 14.0+, iOS 16.0+

## Requirements

- Python 3.9+
- Apple Silicon Mac (M1/M2/M3/M4)
- MLX framework

## License

[MIT License](LICENSE)

## Citation

```bibtex
@misc{mlx-audio,
  author = {Canuma, Prince},
  title = {MLX Audio},
  year = {2025},
  howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
  note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}
```

## Acknowledgements

- [Apple MLX Team](https://github.com/ml-explore/mlx) for the MLX framework
