Metadata-Version: 2.4
Name: mlx-audio-plus
Version: 0.1.1
Summary: MLX Audio Plus is a package for inference of text-to-speech (TTS) and speech-to-speech (STS) models locally on your Mac using MLX
Home-page: https://github.com/DePasqualeOrg/mlx-audio-plus
Author: Anthony DePasquale
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: misaki[en]>=0.8.2
Requires-Dist: loguru>=0.7.3
Requires-Dist: num2words>=0.5.14
Requires-Dist: spacy>=3.8.4
Requires-Dist: phonemizer-fork>=3.3.2
Requires-Dist: espeakng-loader>=0.2.4
Requires-Dist: mlx>=0.25.2
Requires-Dist: mlx-vlm>=0.1.27
Requires-Dist: numpy>=1.26.4
Requires-Dist: transformers>=4.49.0
Requires-Dist: sentencepiece>=0.2.0
Requires-Dist: huggingface_hub>=0.27.0
Requires-Dist: sounddevice>=0.5.1
Requires-Dist: soundfile>=0.13.1
Requires-Dist: fastapi>=0.95.0
Requires-Dist: uvicorn>=0.22.0
Requires-Dist: einops>=0.8.1
Requires-Dist: tiktoken>=0.9.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: pyloudnorm>=0.1.1
Requires-Dist: omegaconf==2.3.0
Requires-Dist: einops==0.8.1
Requires-Dist: einx==0.3.0
Requires-Dist: fastrtc[stt,vad]
Requires-Dist: webrtcvad>=2.0.10
Requires-Dist: dacite>=1.9.2
Requires-Dist: pytest-asyncio>=1.0.0
Requires-Dist: mistral-common[audio]
Provides-Extra: py38
Requires-Dist: importlib_resources; extra == "py38"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# MLX Audio Plus

This library is a fork of [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio) with additional models ported to MLX in Python.

## Features

- Fast inference on Apple Silicon
- Multiple language support
- Voice customization options
- Adjustable speech speed control
- REST API for TTS generation
- Quantization support for optimized performance

## Installation

```bash
pip install mlx-audio-plus
```

### Quick Start

To generate audio with an LLM use:

```bash
# Basic usage
mlx_audio.tts.generate --text "Hello, world"

# Specify prefix for output file
mlx_audio.tts.generate --text "Hello, world" --file_prefix hello

# Adjust speaking speed (0.5-2.0)
mlx_audio.tts.generate --text "Hello, world" --speed 1.4
```

To generate audio with an LLM use:

```python
from mlx_audio.tts.generate import generate_audio

# Example: Generate an audiobook chapter as mp3 audio
generate_audio(
    text=("In the beginning, the universe was created...\n"
        "...or the simulation was booted up."),
    model_path="prince-canuma/Kokoro-82M",
    voice="af_heart",
    speed=1.2,
    lang_code="a", # Kokoro: (a)f_heart, or comment out for auto
    file_prefix="audiobook_chapter1",
    audio_format="wav",
    sample_rate=24000,
    join_audio=True,
    verbose=True  # Set to False to disable print messages
)

print("Audiobook chapter successfully generated!")

```

### FastAPI Server

Start the API server:
```bash
# Using the command-line interface
mlx_audio.server

# With custom host and port
mlx_audio.server --host 0.0.0.0 --port 9000

# With verbose logging
mlx_audio.server --verbose
```

Available command line arguments:
- `--host`: Host address to bind the server to (default: 127.0.0.1)
- `--port`: Port to bind the server to (default: 8000)

#### API Endpoints

The server provides the following REST API endpoints:

- `POST /v1/audio/speech`: Generate speech from text following the OpenAI TTS specification.
  - JSON body parameters:
    - `model`: Name or path of the TTS model to use.
    - `input`: Text to convert to speech.
    - `voice`: Optional voice preset.
    - `speed`: Optional speech speed (default `1.0`).
  - Returns the generated audio in WAV format.

- `POST /v1/audio/transcriptions`: Transcribe audio files using an STT model in a format compatible with OpenAI's API.
  - Multipart form parameters:
    - `file`: The audio file to transcribe.
    - `model`: Name or path of the STT model.
  - Returns JSON containing the transcribed `text`.

- `GET /v1/models`: List loaded models.
- `POST /v1/models`: Load a model by name.
- `DELETE /v1/models`: Unload a model.

> Note: Generated audio files are stored in `~/.mlx_audio/outputs` by default, or in a fallback directory if that location is not writable.

## Models

### Kokoro

Kokoro is a multilingual TTS model that supports various languages and voice styles.

#### Example Usage

```python
from mlx_audio.tts.models.kokoro import KokoroPipeline
from mlx_audio.tts.utils import load_model
from IPython.display import Audio
import soundfile as sf

# Initialize the model
model_id = 'prince-canuma/Kokoro-82M'
model = load_model(model_id)

# Create a pipeline with American English
pipeline = KokoroPipeline(lang_code='a', model=model, repo_id=model_id)

# Generate audio
text = "The MLX King lives. Let him cook!"
for _, _, audio in pipeline(text, voice='af_heart', speed=1, split_pattern=r'\n+'):
    # Display audio in notebook (if applicable)
    display(Audio(data=audio, rate=24000, autoplay=0))

    # Save audio to file
    sf.write('audio.wav', audio[0], 24000)
```

#### Language Options

- 🇺🇸 `'a'` - American English
- 🇬🇧 `'b'` - British English
- 🇯🇵 `'j'` - Japanese (requires `pip install misaki[ja]`)
- 🇨🇳 `'z'` - Mandarin Chinese (requires `pip install misaki[zh]`)

### CSM (Conversational Speech Model)

CSM is a model from Sesame that allows you text-to-speech and to customize voices using reference audio samples.

#### Example Usage

```bash
# Generate speech using CSM-1B model with reference audio
python -m mlx_audio.tts.generate --model mlx-community/csm-1b --text "Hello from Sesame." --play --ref_audio ./conversational_a.wav
```

You can pass any audio to clone the voice from or download sample audio file from [here](https://huggingface.co/mlx-community/csm-1b/tree/main/prompts).

## Advanced Features

### Quantization

You can quantize models for improved performance:

```python
from mlx_audio.tts.utils import quantize_model, load_model
import json
import mlx.core as mx

model = load_model(repo_id='prince-canuma/Kokoro-82M')
config = model.config

# Quantize to 8-bit
group_size = 64
bits = 8
weights, config = quantize_model(model, config, group_size, bits)

# Save quantized model
with open('./8bit/config.json', 'w') as f:
    json.dump(config, f)

mx.save_safetensors("./8bit/kokoro-v1_0.safetensors", weights, metadata={"format": "mlx"})
```

## Requirements

- MLX
- Python 3.8+
- Apple Silicon Mac (for optimal performance)
- For the API:
  - FastAPI
  - Uvicorn

## License

[MIT License](LICENSE)

## Acknowledgements

- Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.
- This project uses the Kokoro model architecture for text-to-speech synthesis.


@misc{mlx-audio-plus,
  author = {DePasquale, Anthony},
  title = {MLX Audio Plus},
  year = {2025},
  howpublished = {\url{https://github.com/DePasqualeOrg/mlx-audio-plus}},
  note = {A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.}
}
