Metadata-Version: 2.4
Name: vllm-mlx
Version: 0.2.3
Summary: vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
Author: vllm-mlx contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/vllm-mlx/vllm-mlx
Project-URL: Documentation, https://github.com/vllm-mlx/vllm-mlx#readme
Project-URL: Repository, https://github.com/vllm-mlx/vllm-mlx
Keywords: llm,mlx,apple-silicon,vllm,inference,transformers
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.29.0
Requires-Dist: mlx-lm>=0.30.2
Requires-Dist: mlx-vlm>=0.1.0
Requires-Dist: transformers<5.0.0,>=4.40.0
Requires-Dist: tokenizers>=0.19.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: gradio>=4.0.0
Requires-Dist: requests>=2.28.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: opencv-python>=4.8.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: uvicorn>=0.23.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: mlx-audio>=0.2.9
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4.0; extra == "vllm"
Provides-Extra: vision
Requires-Dist: torch>=2.3.0; extra == "vision"
Requires-Dist: torchvision>=0.18.0; extra == "vision"
Provides-Extra: audio
Requires-Dist: mlx-audio>=0.2.9; extra == "audio"
Requires-Dist: sounddevice>=0.4.0; extra == "audio"
Requires-Dist: soundfile>=0.12.0; extra == "audio"
Requires-Dist: scipy>=1.10.0; extra == "audio"
Requires-Dist: numba>=0.57.0; extra == "audio"
Requires-Dist: tiktoken>=0.5.0; extra == "audio"
Requires-Dist: misaki[ja,zh]>=0.5.0; extra == "audio"
Requires-Dist: spacy>=3.7.0; extra == "audio"
Requires-Dist: num2words>=0.5.0; extra == "audio"
Requires-Dist: loguru>=0.7.0; extra == "audio"
Requires-Dist: phonemizer>=3.2.0; extra == "audio"
Requires-Dist: ordered_set>=4.1.0; extra == "audio"
Requires-Dist: cn2an>=0.5.0; extra == "audio"
Requires-Dist: fugashi>=1.3.0; extra == "audio"
Requires-Dist: unidic-lite>=1.0.0; extra == "audio"
Requires-Dist: jieba>=0.42.0; extra == "audio"

# vLLM-MLX

**vLLM-like inference for Apple Silicon** - GPU-accelerated Text, Image, Video & Audio on Mac

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Apple Silicon](https://img.shields.io/badge/Apple-Silicon-black.svg)](https://support.apple.com/en-us/HT211814)
[![GitHub](https://img.shields.io/badge/GitHub-waybarrios%2Fvllm--mlx-blue?logo=github)](https://github.com/waybarrios/vllm-mlx)

## Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

- **[MLX](https://github.com/ml-explore/mlx)**: Apple's ML framework with unified memory and Metal kernels
- **[mlx-lm](https://github.com/ml-explore/mlx-lm)**: Optimized LLM inference with KV cache and quantization
- **[mlx-vlm](https://github.com/Blaizzy/mlx-vlm)**: Vision-language models for multimodal inference
- **[mlx-audio](https://github.com/Blaizzy/mlx-audio)**: Speech-to-Text and Text-to-Speech with native voices

## Features

- **Multimodal** - Text, Image, Video & Audio in one platform
- **Native GPU acceleration** on Apple Silicon (M1, M2, M3, M4)
- **Native TTS voices** - Spanish, French, Chinese, Japanese + 5 more languages
- **OpenAI API compatible** - drop-in replacement for OpenAI client
- **MCP Tool Calling** - integrate external tools via Model Context Protocol
- **Paged KV Cache** - memory-efficient caching with prefix sharing
- **Continuous Batching** - high throughput for multiple concurrent users

## Quick Start

### Installation

```bash
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
```

### Start Server

```bash
# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key
```

### Use with OpenAI SDK

```python
from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
```

### Multimodal (Images & Video)

```bash
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
```

```python
response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)
```

### Audio (TTS/STT)

```bash
# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages
```

```bash
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages
```

**Supported TTS Models:**
| Model | Languages | Description |
|-------|-----------|-------------|
| Kokoro | EN, ES, FR, JA, ZH, IT, PT, HI | Fast, 82M params, 11 voices |
| Chatterbox | 15+ languages | Expressive, voice cloning |
| VibeVoice | EN | Realtime, low latency |
| VoxCPM | ZH, EN | High quality Chinese/English |

## Documentation

For full documentation, see the [docs](docs/) directory:

- **Getting Started**
  - [Installation](docs/getting-started/installation.md)
  - [Quick Start](docs/getting-started/quickstart.md)

- **User Guides**
  - [OpenAI-Compatible Server](docs/guides/server.md)
  - [Python API](docs/guides/python-api.md)
  - [Multimodal (Images & Video)](docs/guides/multimodal.md)
  - [Audio (STT/TTS)](docs/guides/audio.md)
  - [MCP & Tool Calling](docs/guides/mcp-tools.md)
  - [Continuous Batching](docs/guides/continuous-batching.md)

- **Reference**
  - [CLI Commands](docs/reference/cli.md)
  - [Supported Models](docs/reference/models.md)
  - [Configuration](docs/reference/configuration.md)

- **Benchmarks**
  - [LLM Benchmarks](docs/benchmarks/llm.md)
  - [Image Benchmarks](docs/benchmarks/image.md)
  - [Video Benchmarks](docs/benchmarks/video.md)
  - [Audio Benchmarks](docs/benchmarks/audio.md)

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    vLLM API Layer                       │
│           (OpenAI-compatible interface)                 │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    MLXPlatform                          │
│       (vLLM platform plugin for Apple Silicon)          │
└─────────────────────────────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│     mlx-lm       │ │   mlx-vlm    │ │    mlx-audio     │
│  (LLM inference) │ │ (Vision+LLM) │ │   (TTS + STT)    │
└──────────────────┘ └──────────────┘ └──────────────────┘
          │                │                  │
          └────────────────┴──────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                        MLX                              │
│         (Apple ML Framework - Metal kernels)            │
└─────────────────────────────────────────────────────────┘
```

## Performance

**LLM Performance (M4 Max, 128GB):**

| Model | Speed | Memory |
|-------|-------|--------|
| Qwen3-0.6B-8bit | 402 tok/s | 0.7 GB |
| Llama-3.2-1B-4bit | 464 tok/s | 0.7 GB |
| Llama-3.2-3B-4bit | 200 tok/s | 1.8 GB |

**Continuous Batching (5 concurrent requests):**

| Model | Single | Batched | Speedup |
|-------|--------|---------|---------|
| Qwen3-0.6B-8bit | 328 tok/s | 1112 tok/s | **3.4x** |
| Llama-3.2-1B-4bit | 299 tok/s | 613 tok/s | **2.0x** |

**Audio - Speech-to-Text (M4 Max, 128GB):**

| Model | RTF* | Use Case |
|-------|------|----------|
| whisper-tiny | **197x** | Real-time, low latency |
| whisper-large-v3-turbo | **55x** | Best quality/speed balance |
| whisper-large-v3 | **24x** | Highest accuracy |

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See [benchmarks](docs/benchmarks/) for detailed results.

## Contributing

We welcome contributions! See [Contributing Guide](docs/development/contributing.md) for details.

- Bug fixes and improvements
- Performance optimizations
- Documentation improvements
- Benchmarks on different Apple Silicon chips

Submit PRs to: [https://github.com/waybarrios/vllm-mlx](https://github.com/waybarrios/vllm-mlx)

## License

Apache 2.0 - see [LICENSE](LICENSE) for details.

## Citation

If you use vLLM-MLX in your research or project, please cite:

```bibtex
@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}
```

## Acknowledgments

- [MLX](https://github.com/ml-explore/mlx) - Apple's ML framework
- [mlx-lm](https://github.com/ml-explore/mlx-lm) - LLM inference library
- [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) - Vision-language models
- [mlx-audio](https://github.com/Blaizzy/mlx-audio) - Text-to-Speech and Speech-to-Text
- [vLLM](https://github.com/vllm-project/vllm) - High-throughput LLM serving
