Metadata-Version: 2.4
Name: mlx-audio-plus
Version: 0.1.7
Summary: MLX-Audio is a package for inference of text-to-speech (TTS) and speech-to-speech (STS) models locally on your Mac using MLX
Author-email: Prince Canuma <prince.gdt@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/DePasqualeOrg/mlx-audio-plus
Project-URL: Repository, https://github.com/DePasqualeOrg/mlx-audio-plus
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx>=0.25.2
Requires-Dist: numpy>=1.26.4
Requires-Dist: huggingface_hub>=0.27.0
Requires-Dist: transformers<5.0.0,>=4.49.0
Requires-Dist: hf_transfer
Provides-Extra: stt
Requires-Dist: tiktoken>=0.9.0; extra == "stt"
Requires-Dist: tqdm>=4.67.1; extra == "stt"
Provides-Extra: tts
Requires-Dist: misaki[en]>=0.8.2; extra == "tts"
Requires-Dist: loguru>=0.7.3; extra == "tts"
Requires-Dist: num2words>=0.5.14; extra == "tts"
Requires-Dist: spacy>=3.8.4; extra == "tts"
Requires-Dist: phonemizer-fork>=3.3.2; extra == "tts"
Requires-Dist: espeakng-loader>=0.2.4; extra == "tts"
Requires-Dist: sentencepiece>=0.2.0; extra == "tts"
Requires-Dist: sounddevice>=0.5.1; extra == "tts"
Requires-Dist: soundfile>=0.13.1; extra == "tts"
Requires-Dist: einops>=0.8.1; extra == "tts"
Requires-Dist: tqdm>=4.67.1; extra == "tts"
Requires-Dist: pyloudnorm>=0.1.1; extra == "tts"
Requires-Dist: omegaconf==2.3.0; extra == "tts"
Requires-Dist: einx==0.3.0; extra == "tts"
Requires-Dist: dacite>=1.9.2; extra == "tts"
Requires-Dist: mistral-common[audio]; extra == "tts"
Provides-Extra: server
Requires-Dist: fastapi>=0.95.0; extra == "server"
Requires-Dist: uvicorn>=0.22.0; extra == "server"
Provides-Extra: sts
Requires-Dist: mlx-audio-plus[stt,tts]; extra == "sts"
Requires-Dist: mlx-vlm>=0.1.27; extra == "sts"
Requires-Dist: fastrtc[stt,vad]; extra == "sts"
Requires-Dist: webrtcvad>=2.0.10; extra == "sts"
Provides-Extra: all
Requires-Dist: mlx-audio-plus[server,sts,stt,tts]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=1.0.0; extra == "dev"
Dynamic: license-file

# MLX Audio Plus

## Motivation

This fork removes a large amount of cruft (incompatibly licensed code and data that should not be included in the repo) from [Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio). In addition to the models from that repo, this one includes improvements as well as the following new models ported to MLX in Python:

- TTS
  - [Chatterbox](https://github.com/resemble-ai/chatterbox)
  - [CosyVoice 2](https://github.com/FunAudioLLM/CosyVoice)
  - [CosyVoice 3](https://github.com/FunAudioLLM/CosyVoice)
- STT
  - [Fun-ASR](https://github.com/modelscope/FunASR)

Improvements to the upstream repo will continue to be merged here.

This repo also serves as the basis for Swift ports of models in [mlx-swift-audio](https://github.com/DePasqualeOrg/mlx-swift-audio).

## Installation

```bash
pip install mlx-audio-plus
```

## Usage

### CLI

```bash
# CosyVoice 3: cross-lingual mode (reference audio only)
mlx_audio.tts.generate --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
    --text "Hello, this is a test of text to speech." \
    --ref_audio reference.wav

# CosyVoice 3: zero-shot mode (with transcription for better quality)
mlx_audio.tts.generate --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
    --text "Hello, this is a test of text to speech." \
    --ref_audio reference.wav \
    --ref_text "This is what I said in the reference audio."

# CosyVoice 3: instruct mode with style control
mlx_audio.tts.generate --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
    --text "I have exciting news!" \
    --ref_audio reference.wav \
    --instruct_text "Speak with excitement and enthusiasm"

# CosyVoice 3: voice conversion
mlx_audio.tts.generate --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
    --ref_audio target_speaker.wav \
    --source_audio source_speech.wav

# Play audio directly instead of saving
mlx_audio.tts.generate --model mlx-community/Fun-CosyVoice3-0.5B-2512-4bit \
    --text "Hello world" \
    --ref_audio reference.wav \
    --play

# Chatterbox: generate speech from reference audio
mlx_audio.tts.generate --model mlx-community/Chatterbox-TTS-4bit \
    --text "The quick brown fox jumped over the lazy dog." \
    --ref_audio reference.wav
```

### Python

```python
from mlx_audio.tts.generate import generate_audio

# CosyVoice 3: cross-lingual mode (reference audio only)
generate_audio(
    text="Hello, this is a test of text to speech.",
    model="mlx-community/Fun-CosyVoice3-0.5B-2512-4bit",
    ref_audio="reference.wav",
    file_prefix="output",  # Optional
    audio_format="wav",  # Optional
)

# CosyVoice 3: zero-shot mode (with transcription for better quality)
generate_audio(
    text="Bonjour, comment allez-vous aujourd'hui?",
    model="mlx-community/Fun-CosyVoice3-0.5B-2512-4bit",
    ref_audio="reference.wav",
    ref_text="This is what I said in the reference audio.",
)

# CosyVoice 3: instruct mode with style control
generate_audio(
    text="I have some exciting news to share with you!",
    model="mlx-community/Fun-CosyVoice3-0.5B-2512-4bit",
    ref_audio="reference.wav",
    instruct_text="Speak with excitement and enthusiasm",
)

# CosyVoice 3: voice conversion (convert source audio to target speaker)
generate_audio(
    model="mlx-community/Fun-CosyVoice3-0.5B-2512-4bit",
    ref_audio="target_speaker.wav",  # Target voice
    source_audio="source_speech.wav",
)

# Chatterbox: generate speech from reference audio
generate_audio(
    text="The quick brown fox jumped over the lazy dog.",
    model="mlx-community/Chatterbox-TTS-4bit",
    ref_audio="reference.wav",
)
```

### Speech to text

```python
from mlx_audio.stt.models.funasr import Model

# Fun-ASR

# Load the model
model = Model.from_pretrained("mlx-community/Fun-ASR-Nano-2512-4bit")

# Basic transcription
result = model.generate("audio.wav")
print(result.text)

# Translation (speech to English text)
result = model.generate(
    "chinese_speech.wav",
    task="translate",
    target_language="en"
)

# Custom prompting for domain-specific content
result = model.generate(
    "medical_dictation.wav",
    initial_prompt="Medical consultation discussing cardiac symptoms."
)

# Streaming output
for chunk in model.generate("audio.wav", stream=True):
    print(chunk, end="", flush=True)
```

