Metadata-Version: 2.4
Name: mlx-optiq
Version: 0.0.1
Summary: Mixed-precision quantization optimizer for MLX models on Apple Silicon
Author: Thin Signal
License: MIT
Project-URL: Models, https://huggingface.co/collections/mlx-community
Keywords: mlx,quantization,mixed-precision,apple-silicon,llm,kv-cache
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: mlx>=0.20
Requires-Dist: mlx-lm>=0.20
Requires-Dist: numpy
Requires-Dist: scipy
Provides-Extra: convert
Requires-Dist: torch>=2.0; extra == "convert"
Requires-Dist: transformers>=4.40; extra == "convert"
Requires-Dist: safetensors; extra == "convert"
Requires-Dist: tqdm; extra == "convert"
Requires-Dist: datasets; extra == "convert"
Provides-Extra: vlm
Requires-Dist: mlx-vlm>=0.3; extra == "vlm"
Requires-Dist: pillow; extra == "vlm"
Provides-Extra: audio
Requires-Dist: mlx-whisper>=0.4; extra == "audio"
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: psutil; extra == "cli"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.40; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Requires-Dist: datasets; extra == "all"
Requires-Dist: mlx-vlm>=0.3; extra == "all"
Requires-Dist: mlx-whisper>=0.4; extra == "all"
Requires-Dist: click>=8.0; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pillow; extra == "all"

# OptiQ

Optimizing compiler that takes PyTorch models and produces hardware-optimized MLX versions via **data-driven mixed-precision quantization**.

## What It Does

OptiQ analyzes each layer's sensitivity to quantization, then intelligently assigns per-layer bit-widths — giving more bits to sensitive layers and fewer bits to robust ones. This achieves better quality than uniform quantization at similar model sizes.

Supports **LLMs** (Qwen3), **Vision-Language Models** (Qwen2-VL), and **Speech Recognition** (Qwen3-ASR), with actual MLX deployment via mlx-lm, mlx-vlm, and mlx-audio.

### How It Differs From Existing Tools

| Tool | Approach | Limitation |
|------|----------|------------|
| mlx-lm uniform | Same bit-width everywhere | Wastes bits on insensitive layers |
| mlx-lm static recipes | Fixed heuristic (first/last blocks get more bits) | Not data-driven, model-agnostic |
| mlx-lm KV cache | Uniform KV quantization (same bits all layers) | Layer 0's KV cache is 56x more sensitive |
| **OptiQ** | Per-layer sensitivity measurement → optimal bit assignment for weights AND KV cache | — |

## Quick Start

```bash
# Install
cd optiq
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Convert an LLM — Quality profile (best accuracy, 11% larger)
python -m optiq.cli convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Convert an LLM — Compact profile (17% smaller than uniform 4-bit)
python -m optiq.cli convert Qwen/Qwen3-0.6B-base --target-bpw 3.5 --candidate-bits 3,4

# Evaluate on standard benchmarks
python -m optiq.cli eval ./optiq_output/model --task gsm8k --baseline ./uniform_4bit

# Benchmark speed and perplexity
python -m optiq.cli benchmark ./optiq_output/model --baseline ./optiq_output/uniform_4bit
```

## Demo Results (Qwen3-0.6B-base)

OptiQ offers two profiles: **Quality** (best accuracy) and **Compact** (smallest size).

### Perplexity & Model Size

```
  Model                           Size (MB)    BPW      PPL
  ---------------------------------------------------------------
  Uniform 4-bit (mlx-lm)             319.9   4.00   19.227
  Static mixed_3_6 (mlx-lm)          269.9   3.60   37.318
  OptiQ Compact (3/4-bit)             265.7   3.50   32.732
  OptiQ Quality (4/8-bit)             355.4   4.50   17.429
```

### GSM8K (Math Reasoning, 100-question sample)

```
  Model                            Accuracy
  --------------------------------------------------
  OptiQ Quality (4/8-bit)            51.0%
  Uniform 4-bit                      34.0%
  OptiQ Compact (3/4-bit)             9.0%
```

### Quality profile (`--target-bpw 4.5 --candidate-bits 4,8`)

**+17 percentage points on GSM8K** (51% vs 34%) and **lower perplexity** (17.4 vs 19.2) compared to uniform 4-bit. The KL divergence sensitivity analysis correctly identifies which layers need more bits — notably `lm_head` (8.0x sensitivity ratio), last transformer block layers (up to 6.8x), and early layers with downstream amplification (10.4x). Query/key projections are the most robust (1.0x ratio) and safely quantized to 4-bit.

The tradeoff is 11% larger model size (355 vs 320 MB) due to upgrading ~15 sensitive layers to 8-bit.

### Compact profile (`--target-bpw 3.5 --candidate-bits 3,4`)

**17% smaller** than uniform 4-bit (266 vs 320 MB) and **better perplexity** than mlx-lm's static mixed recipe (32.7 vs 37.3) at similar size. The smart allocation puts 4-bit on the ~77 most sensitive layers and 3-bit on the remaining 120 layers. This is the right choice when model size and memory footprint are the primary constraint.

The tradeoff is reduced reasoning accuracy — 3-bit quantization degrades math reasoning on small models (9% GSM8K). For larger models (7B+) where `lm_head` is a smaller fraction of total parameters, the compact profile preserves more quality.

### Note on throughput

For this small model (0.6B), **83% of decode latency is overhead** (attention, normalization, KV cache, framework), not weight loading. This compresses the throughput range: uniform 3-bit runs at ~95 tok/s vs uniform 8-bit at ~74 tok/s — only a 28% speedup despite 63% less model data. OptiQ's calibrated latency model predicts actual throughput within **3.8% mean error** using a single reference measurement (see [Hardware-Aware Optimization](#hardware-aware-latency-model)).

## Demo Results (Qwen3-VL-2B — Vision-Language Model)

Qwen3-VL-2B-Instruct is a 2.1B-parameter vision-language model with **301 Linear layers** across its vision encoder (104 layers), cross-modal projection (8 layers), and language model (197 layers). Sensitivity analysis uses multimodal calibration (COCO images + text prompts) and AI2D diagram understanding for evaluation.

### AI2D (Diagram Understanding, 100-question sample)

```
  Model                          Size (MB)   AI2D Accuracy
  -----------------------------------------------------------
  FP16 (reference)                 4057.9      54.0%
  OptiQ Quality (4/8-bit)          1270.8      41.0%
  Uniform 4-bit (mlx-vlm)         1699.5      35.0%
```

**OptiQ recovers 32% of the quantization accuracy gap** — uniform 4-bit drops 19 percentage points from FP16 (54% → 35%), while OptiQ drops only 13pp (54% → 41%). OptiQ is also **25% smaller** (1271 vs 1700 MB). This counterintuitive result comes from how mlx-vlm handles VLMs: its default "uniform 4-bit" skips the vision encoder entirely (resulting in 6.7 effective BPW), while OptiQ's sensitivity analysis correctly identifies which vision layers need precision and quantizes the rest. The result is a smarter allocation at 5.0 BPW that preserves visual reasoning better than the bloated default.

The sensitivity analysis reveals that **85 of 104 vision encoder layers** are highly sensitive and assigned 8-bit, while **195 of 197 language model layers** are robust and safely quantized to 4-bit. The most sensitive layer is `model.language_model.layers.2.mlp.down_proj` (KL divergence 893.1 at 4-bit vs 8.6 at 8-bit — a 104x sensitivity ratio).

## Demo Results (Qwen3-ASR-0.6B — Speech Recognition)

Qwen3-ASR-0.6B has **307 Linear layers** (622M params) across its audio encoder (111 layers) and text decoder (196 layers). The audio encoder stays in BF16 (matching mlx-audio's format), while the text decoder is quantized with OptiQ's sensitivity-driven mixed-precision. Sensitivity analysis runs natively in MLX using LibriSpeech audio calibration.

### Output Quality (KL Divergence vs BF16)

```
  Variant                          KL Divergence   Est. Size    BPW
  -------------------------------------------------------------------
  BF16 (reference)                           0.0    1492.4 MB   16.0
  Uniform 4-bit                           2.045     862.4 MB    4.0
  OptiQ Quality (4/8-bit)                 0.993     914.9 MB    5.0
```

**OptiQ reduces output divergence by 51.5%** vs uniform 4-bit at a cost of +6.1% model size. The sensitivity analysis reveals that the text decoder is 2x more sensitive than the audio encoder (mean KL: 0.016 vs 0.008). Layer 0 (first) and layer 27 (last) of the decoder are the most sensitive, with `model.layers.2.mlp.down_proj` showing a 44x sensitivity ratio between 4-bit and 8-bit quantization.

### WER (Word Error Rate, LibriSpeech test-clean, 100 samples)

```
  Model                          Size (MB)   WER      vs BF16
  ---------------------------------------------------------------
  BF16 (reference)                 1492.4   16.23%      ---
  OptiQ mixed (BPW=5.0)            941.2   16.98%    +0.75%
  Uniform 4-bit                     888.7   17.08%    +0.85%
```

**OptiQ improves WER by 0.10%** vs uniform 4-bit (16.98% vs 17.08%) at only 6% more model size, while being 37% smaller than BF16. The mixed-precision model allocates 59 of 196 decoder layers to 8-bit (most sensitive layers identified by KL divergence analysis) and 137 layers to 4-bit.

## KV Cache Mixed-Precision Quantization

The KV cache stores key/value projections for all past tokens during autoregressive generation. For long contexts, it dominates memory — often exceeding the model weights themselves. MLX supports uniform KV cache quantization (`kv_bits=4`), but not all layers' KV caches are equally important.

OptiQ measures per-layer KV cache sensitivity and assigns bit-widths accordingly, keeping sensitive layers at 8-bit while aggressively quantizing robust layers to 4-bit.

### KV Cache Memory Impact (Qwen3-0.6B-base)

```
  Seq Len       FP16      4-bit  Mixed 5-bit      Model
  -----------------------------------------------------------
    1024     112.0M      28.0M      42.0M       320M
    4096     448.0M     112.0M     168.0M       320M
   16384    1792.0M     448.0M     672.0M       320M
   32768    3584.0M     896.0M    1344.0M       320M
```

At 4K context, the FP16 KV cache (448 MB) is **140% of the model weights** (320 MB). Quantization is essential for long sequences.

### Per-Layer Sensitivity

```
  Most sensitive layers (4-bit KL divergence):
    Layer 0:  KL=2.4401  (55.9x vs 8-bit — must be 8-bit)
    Layer 2:  KL=0.0925  (3.4x vs 8-bit)
    Layer 17: KL=0.0559  (1.1x vs 8-bit)

  Least sensitive layers:
    Layer 5:  KL=0.0000  (safe to quantize to 4-bit)
    Layer 24: KL=0.0000
```

Layer 0's KV cache is **56x more sensitive** than average — uniform 4-bit quantization is catastrophic because it quantizes this critical layer alongside all others.

### Quality Benchmark (Perplexity)

```
  Config                         PPL    KV Memory    vs FP16
  -----------------------------------------------------------
  FP16 KV (reference)          21.23       28.0M
  Uniform 8-bit KV             21.59       15.8M     +0.36
  OptiQ 6-bit KV               28.33       12.2M     +7.10
  OptiQ 5-bit KV               31.10       10.5M     +9.87
  Uniform 4-bit KV            507.50        8.8M   +486.27
```

**Uniform 4-bit KV is catastrophic** (PPL 507) because it quantizes Layer 0's critical KV cache. OptiQ 5-bit KV keeps 7 sensitive layers at 8-bit and achieves **PPL 31 — 16x better than uniform 4-bit** — at nearly the same memory savings (62% vs 69% reduction from FP16).

![KV Cache Analysis](optiq_output/kv_cache/kv_cache_analysis.png)

```bash
# Run KV cache analysis
python demo/demo_kv_cache.py

# CLI: analyze and optimize KV cache
optiq kv-cache ./my_model --target-bits 5.0
```

## Combined: Mixed Weights + Mixed KV Cache

The full OptiQ stack applies mixed-precision quantization to **both** model weights and KV cache. This is especially impactful for long contexts where KV cache dominates memory.

### Total Memory (Weights + KV Cache)

```
  Config                               Weights    KV@4K   Total@4K
  ------------------------------------------------------------------
  mlx-lm default (U4 wts + FP16 KV)    320M     448M       768M
  U4 wts + uniform 4-bit KV            320M     112M       432M    (PPL=507!)
  Full OptiQ (OQ wts + mixed 5b KV)    355M     140M       495M
```

At 16K context, the full OptiQ stack saves **1196 MB (57%)** vs the mlx-lm default. Uniform 4-bit KV is catastrophic (PPL 507) while OptiQ's mixed KV achieves PPL 26.8.

### All Configurations Compared

```
  Config                         PPL  Weights  KV@256   Total
  ---------------------------------------------------------------
  OQ wts + FP16 KV             18.99   355M    28.0M    383M  (best quality)
  OQ wts + U8 KV               19.10   355M    14.0M    369M
  U4 wts + FP16 KV             21.23   320M    28.0M    348M  (mlx-lm default)
  U4 wts + U8 KV               21.59   320M    14.0M    334M
  OQ wts + OptiQ KV            26.84   355M     8.8M    364M  (full stack)
  U4 wts + OptiQ KV            31.10   320M     8.8M    329M
  OQ wts + U4 KV              434.17   355M     7.0M    362M  (broken)
  U4 wts + U4 KV              507.50   320M     7.0M    327M  (broken)
```

**Key insight**: Weight quantization and KV cache quantization are complementary. The KV sensitivity profile changes based on weight quantization — OptiQ runs sensitivity analysis on the actual quantized model, producing a tailored KV allocation (10 of 28 layers differ between weight variants).

![Combined Analysis](optiq_output/combined/combined_analysis.png)

```bash
python demo/demo_combined.py     # Full stack demo (~10 min)
```

## TurboQuant KV Cache (Rotation-Based Quantization)

OptiQ implements TurboQuant ([arxiv 2504.19874](https://arxiv.org/abs/2504.19874)) for KV cache compression. Instead of standard affine quantization (per-group scale/bias), TurboQuant applies a random orthogonal rotation before scalar quantization. This preserves the inner product structure that attention's Q*K^T computation relies on.

### Results (Qwen3.5-0.8B, 6 self-attention layers)

```
  Method                          PPL   PPL delta  Needle retrieval
  -------------------------------------------------------------------
  FP16 KV (reference)           22.50      —           73%
  Affine 8-bit KV               22.50    +0.00         73%
  Affine 4-bit KV               22.98    +0.48         80%
  TurboQuant MSE 8-bit KV       22.51    +0.01         73%
  TurboQuant MSE 4-bit KV       22.87    +0.37         93%
  TurboQuant MSE 3-bit KV       23.66    +1.15        100%
```

Key findings:
- **TurboQuant MSE 4-bit beats affine 4-bit** on both PPL (+0.37 vs +0.48) and needle retrieval (93% vs 80%)
- **TurboQuant enables 3-bit KV cache** where affine can't (head_dim=256 packing incompatibility), with 100% needle retrieval
- TurboQuant's rotation acts as regularization, decorrelating dimensions and making keys more robust to quantization

### Usage

```python
from mlx_lm import load
from optiq.core.turbo_kv_cache import TurboQuantKVCache

model, tokenizer = load("mlx-community/Qwen3.5-0.8B-OptiQ-4bit")

# Replace self-attention KV caches with TurboQuant
cache = model.make_cache()
for i, layer in enumerate(model.layers):
    if hasattr(layer, "self_attn"):
        cache[i] = TurboQuantKVCache(
            head_dim=layer.self_attn.head_dim, bits=4, seed=42+i
        )

# Use normally — TurboQuant KV is transparent to mlx-lm
logits = model(input_ids, cache=cache)
```

### Architecture Note (Hybrid Models)

Qwen3.5 uses 18 GatedDeltaNet layers (recurrent state) + 6 standard self-attention layers (KV cache). TurboQuant is applied only to the KV cache layers. The recurrent state uses a read-modify-write pattern that causes error accumulation with quantization — keeping it at FP16 is recommended for generation tasks.

## Pareto Frontier Analysis

OptiQ dominates the entire size-quality tradeoff between uniform 3-bit and uniform 8-bit. At every model size, an OptiQ configuration matches or beats the uniform baseline.

The analysis sweeps two proven 2-tier profiles across BPW targets:
- **Compact [3,4]-bit**: BPW 3.0–4.0 — smaller models with smart 3/4-bit allocation
- **Quality [4,8]-bit**: BPW 4.0–8.0 — best accuracy with smart 4/8-bit allocation

### Pareto Frontier (Qwen3-0.6B-base)

```
  Model                           Size (MB)      PPL
  -------------------------------------------------------
  Uniform 3-bit (baseline)          248.9       46.807
  OptiQ Compact BPW 3.25            266.6       30.550
  OptiQ Compact BPW 3.75            283.6       24.493
  OptiQ Compact BPW 4.0             301.0       21.879
  Uniform 4-bit (baseline)          319.9       19.004
  OptiQ Quality BPW 4.5             355.4       17.118
  OptiQ Quality BPW 5.0             390.9       16.354
  OptiQ Quality BPW 6.5             422.9       16.001
  OptiQ Quality BPW 8.0             529.9       15.442
  Uniform 8-bit (baseline)          604.1       14.947
```

Key findings:
- **OptiQ Compact at BPW 3.25**: 35% better perplexity than uniform 3-bit (30.6 vs 46.8) at just 7% more size
- **OptiQ Quality at BPW 4.5**: 10% better perplexity than uniform 4-bit (17.1 vs 19.0) at 11% more size
- **OptiQ Quality at BPW 8.0**: Matches uniform 8-bit quality (15.4 vs 14.9) at 12% smaller size

![Pareto Frontier](optiq_output/pareto/pareto_frontier.png)

## Hardware-Aware Latency Model

OptiQ includes an analytical latency model for Apple Silicon that predicts decode throughput from any quantization allocation **without converting or benchmarking** the model.

### How It Works

On Apple Silicon, decode inference (seq_len=1) is **memory-bandwidth-bound**: the GPU must read all model weights from unified memory for each token. This makes latency predictable:

```
per_token_latency = weight_bytes / memory_bandwidth + overhead
```

The model accounts for:
- Per-layer weight bytes (bits * params / 8 + scale/bias overhead)
- Dequantization overhead (3-bit unpacking is 8% slower than 4-bit)
- Per-layer dispatch overhead (~5 us fixed cost)
- **One-point calibration**: measure one model's actual throughput to learn the constant overhead (attention, normalization, KV cache, framework)

### Validation (Qwen3-0.6B-base on M3 Max)

```
  Model                         Size MB   Actual  Predicted    Error
  ------------------------------------------------------------------
  Uniform 4-bit                   319.9    86.8      86.8    +0.0% *
  Uniform 3-bit                   248.9    94.9      88.4    -6.9%
  Uniform 8-bit                   604.1    74.4      80.8    +8.6%
  OptiQ [4,8] BPW 4.5             355.4    83.7      86.0    +2.6%
  OptiQ [3,4] BPW 3.5             265.7    88.8      88.0    -0.9%
  OptiQ [4,8] BPW 6.0             387.4    85.1      85.2    +0.2%
  (* = calibration point)    Mean error (excl. cal): 3.8%
```

Key insight: For this small model, **83% of decode latency is non-weight overhead**. This means the throughput range is compressed (75-95 tok/s across all configs). For larger models (7B+), weight loading dominates and bit-width has a larger impact on speed.

### Latency-Budget Optimizer

Given a throughput target, OptiQ finds the allocation that maximizes quality at that speed:

```bash
# What's the best quality I can get at each speed?
python demo/demo_hardware.py --validate
```

```
  Target tok/s    BPW   Size MB  Est tok/s  Layers @ high
  600             4.00    284.2       513              0
  500             4.22    299.7       501             14
  400             6.56    465.9       400            111
  350             7.96    565.4       357            194
```

![Speed-Quality Frontier](optiq_output/hardware_aware/speed_quality_frontier.png)

---

Run the demos yourself:

```bash
python demo/demo_llm.py      # LLM pipeline + GSM8K eval (~40 min)
python demo/demo_vlm.py      # VLM sensitivity analysis + MLX deployment (~20 hours)
python demo/demo_audio.py    # Qwen3-ASR sensitivity + WER eval (~15 min)
python demo/demo_pareto.py   # Pareto frontier sweep (~20 min, uses cached sensitivity)
python demo/demo_hardware.py # Hardware-aware latency model (~2 min, uses cached data)
python demo/demo_kv_cache.py # KV cache mixed-precision analysis (~5 min)
python demo/demo_combined.py # Full stack: mixed weights + mixed KV cache (~10 min)
```

## How It Works

### Pipeline

1. **Load model** in PyTorch (HuggingFace LLMs, VLMs) or MLX (Qwen3-ASR via mlx-audio)
2. **Sensitivity analysis** — per-layer KL divergence measurement to score each layer's sensitivity to quantization
3. **Mixed-precision optimization** — greedy knapsack over candidate bit-widths (e.g. 4,8) to hit target BPW, with automatic protection of critical layers (lm_head, embed_tokens, first/last blocks)
4. **MLX conversion** — calls `mlx_lm.convert()` for LLMs, `mlx_vlm.convert()` for VLMs, `mlx.nn.quantize()` + mlx-audio for speech models
5. **Evaluation** — GSM8K accuracy (LLM), AI2D diagram understanding (VLM), WER on LibriSpeech (audio), output KL divergence, plus perplexity

### Sensitivity Analysis

OptiQ uses **per-layer KL divergence measurement** — the only reliable way to rank layer sensitivity:

1. For each layer and each candidate bit-width: replace weights with simulated quantized version, run full forward pass, measure KL divergence of output logits vs reference
2. Made practical via: very few calibration samples (2), short sequences (128 tokens), batch size 1, and progress checkpointing (resume interrupted runs)
3. Critical layers (`lm_head`, `embed_tokens`, first/last transformer blocks) are always protected with high bits

This measures *actual output degradation*, which captures:
- Layers where small weight errors amplify through downstream layers
- Attention layers where key/query sensitivity differs from value sensitivity
- Layers critical for reasoning vs fluency

```bash
# Quality: best accuracy at ~4.5 BPW
python -m optiq.cli convert Qwen/Qwen3-0.6B-base --target-bpw 4.5 --candidate-bits 4,8

# Compact: smallest size at ~3.5 BPW
python -m optiq.cli convert Qwen/Qwen3-0.6B-base --target-bpw 3.5 --candidate-bits 3,4
```

## Calibration and Evaluation Datasets

### LLM Calibration (Sensitivity Analysis)

**Dataset**: [WikiText-2](https://huggingface.co/datasets/wikitext) (`wikitext-2-raw-v1`, validation split)

- ~3,760 text samples from Wikipedia articles
- Filtered to passages >100 characters, tokenized and chunked into fixed-length sequences
- 128-token sequences, 2 samples per layer (short sequences keep per-layer passes fast)
- Shuffled with fixed seed (42) for reproducibility

WikiText-2 is the standard calibration dataset used across quantization literature (GPTQ, AWQ, mlx-lm's own `dynamic_quant.py`). It provides diverse, natural language that exercises the model broadly without overfitting to any specific domain.

### LLM Evaluation

**Perplexity**: WikiText-2 validation split — up to 50 samples × 512 tokens. Cross-entropy loss via log-softmax, exponentiated to perplexity. Good for measuring fluency degradation.

**GSM8K**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) test split — 1,319 grade school math word problems. 3-shot chain-of-thought prompting, greedy decoding, exact numeric answer matching. Good for measuring whether the quantized model can still *reason*, not just produce fluent text.

### VLM Calibration

**Dataset**: COCO val2017 images paired with text prompts ("Describe this image", etc.) via model's processor. Creates multimodal inputs exercising both vision encoder and language model paths.

### Audio Calibration

**Dataset**: [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) (clean, validation split) — read English speech samples with transcriptions. Audio is processed via Qwen3-ASR's feature extractor (`_preprocess_audio`), with calibration run natively in MLX for sensitivity analysis.

### Audio Evaluation

**WER**: LibriSpeech test-clean split — 100 random samples, transcribed with Qwen3-ASR's `model.generate()` at temperature=0, word-level edit distance (substitutions + insertions + deletions) divided by total reference words.

## Project Structure

```
optiq/
├── pyproject.toml              # Package config + dependencies
├── requirements.txt
├── optiq/
│   ├── cli.py                  # Click CLI: convert, benchmark, eval
│   ├── core/
│   │   ├── importer.py         # Load PyTorch models (HuggingFace)
│   │   ├── sensitivity.py      # Per-layer KL divergence + fast Fisher sensitivity
│   │   ├── optimizer.py        # Mixed-precision bit assignment + latency-aware optimizer
│   │   ├── kv_cache.py         # Per-layer KV cache sensitivity + mixed-precision optimization
│   │   ├── turbo_quant.py     # TurboQuant: rotation + optimal scalar quantization
│   │   ├── turbo_kv_cache.py  # TurboQuant KV cache for self-attention layers
│   │   ├── turbo_state_cache.py # Quantized state cache for GatedDeltaNet
│   │   ├── latency.py         # Apple Silicon latency model (roofline + calibration)
│   │   └── verifier.py        # Output quality verification
│   ├── backends/
│   │   └── mlx_backend.py      # MLX conversion (mlx-lm + mlx-vlm + mlx-audio)
│   ├── models/
│   │   └── llm.py              # LLM pipeline orchestration
│   ├── eval/
│   │   ├── gsm8k.py            # GSM8K math reasoning evaluation
│   │   ├── ai2d.py             # AI2D diagram understanding evaluation
│   │   └── wer.py              # Word Error Rate evaluation (Qwen3-ASR + LibriSpeech)
│   ├── calibration/
│   │   └── datasets.py         # Calibration data loading
│   └── utils/
│       └── benchmark.py        # Perplexity, throughput, comparison tables
└── demo/
    ├── demo_llm.py             # End-to-end LLM demo (Qwen3-0.6B-base)
    ├── demo_vlm.py             # VLM sensitivity + MLX deployment (Qwen3-VL-2B)
    ├── demo_audio.py           # Qwen3-ASR speech recognition + WER eval
    ├── demo_pareto.py          # Pareto frontier sweep (size vs quality)
    ├── demo_hardware.py        # Hardware-aware latency model + speed-quality frontier
    ├── demo_kv_cache.py        # KV cache mixed-precision quantization demo
    └── demo_combined.py        # Full stack: mixed weights + mixed KV cache
```

## CLI Reference

### `optiq convert`

```
optiq convert MODEL [OPTIONS]

Options:
  --target TEXT                Target backend (default: mlx)
  --model-type [auto|llm]
  --target-bpw FLOAT           Target bits per weight (default: 4.5)
  --candidate-bits TEXT         Candidate bit-widths, comma-separated (default: 4,8)
  --group-size INT             Quantization group size (default: 64)
  -o, --output TEXT            Output directory
  --n-calibration INT          Calibration samples (default: 8)
  --skip-baselines             Skip baseline conversions
```

### `optiq eval`

```
optiq eval MODEL_PATH [OPTIONS]

Options:
  --task [gsm8k|ai2d|wer]   Evaluation task
  --baseline TEXT         Baseline model path for comparison
  --n-samples INT         Number of evaluation samples (default: 200)
```

### `optiq benchmark`

```
optiq benchmark MODEL_PATH [OPTIONS]

Options:
  --baseline TEXT            Baseline model path for comparison
  --n-samples INT            Evaluation samples (default: 50)
```

### `optiq kv-cache`

```
optiq kv-cache MODEL_PATH [OPTIONS]

Options:
  --target-bits FLOAT      Target average KV cache bits (default: 5.0)
  --candidate-bits TEXT     Candidate bit-widths, comma-separated (default: 4,8)
  --n-samples INT           Calibration samples (default: 5)
  --seq-len INT             Calibration sequence length (default: 512)
  --group-size INT          Quantization group size (default: 64)
  -o, --output TEXT         Output directory for results
```

Analyzes per-layer KV cache sensitivity and generates a mixed-precision configuration. The output `kv_config.json` can be used with `generate_with_mixed_kv()` for inference with optimized KV cache quantization.

### `optiq latency`

```
optiq latency MODEL_PATH [OPTIONS]

Options:
  --calibrate                Measure actual throughput and calibrate model
```

Predicts decode throughput using the Apple Silicon latency model. With `--calibrate`, measures actual throughput once and learns the overhead constant for accurate predictions on subsequent models.

## Requirements

- Python >= 3.11
- Apple Silicon Mac (for MLX)
- ~2GB RAM for Qwen3-0.6B-base
