Metadata-Version: 2.4
Name: jang
Version: 1.1.0
Summary: JANG — Adaptive Mixed-Precision Quantization for Apple Silicon. The GGUF equivalent for MLX.
Author-email: Jinho Jang <eric@jangq.ai>
License: Apache-2.0
Project-URL: Homepage, https://jangq.ai
Project-URL: Repository, https://github.com/jjang-ai/jangq
Project-URL: Documentation, https://github.com/jjang-ai/jangq#readme
Project-URL: Bug Tracker, https://github.com/jjang-ai/jangq/issues
Project-URL: HuggingFace, https://huggingface.co/JANGQ-AI
Keywords: quantization,llm,apple-silicon,metal,mlx,jang,moe,mixed-precision
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: MacOS
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: safetensors>=0.4
Requires-Dist: numpy>=1.24
Requires-Dist: tqdm>=4.60
Requires-Dist: huggingface_hub>=0.20
Provides-Extra: mlx
Requires-Dist: mlx>=0.22; extra == "mlx"
Requires-Dist: mlx-lm>=0.20; extra == "mlx"
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Requires-Dist: transformers>=4.40; extra == "torch"
Provides-Extra: all
Requires-Dist: jang[mlx]; extra == "all"
Requires-Dist: jang[torch]; extra == "all"

<p align="center">
  <img src="assets/jangq-logo-dark.png" alt="JANG" width="400">
</p>

<h3 align="center"><b>J</b>ang <b>A</b>daptive <b>N</b>-bit <b>G</b>rading</h3>
<h4 align="center">Mixed-Precision Quantization for Apple Silicon</h4>

<p align="center">
  The GGUF equivalent for MLX — models stay quantized in GPU memory at full Metal speed.<br>
  Open-source quantization format + tools + inference engine.
</p>

<p align="center">
  <a href="https://jangq.ai">Website</a> •
  <a href="https://huggingface.co/JANGQ-AI">Pre-quantized Models</a> •
  <a href="FORMAT.md">Format Spec</a> •
  <a href="jang-tools/">Quantization Tools</a> •
  <a href="research/">Research & Experiments</a>
</p>

<p align="center">
  <img alt="License" src="https://img.shields.io/badge/license-Apache%202.0-blue">
  <img alt="Python" src="https://img.shields.io/badge/python-3.11+-green">
  <img alt="Platform" src="https://img.shields.io/badge/platform-Apple%20Silicon-black">
  <img alt="Format" src="https://img.shields.io/badge/format-v1.1-purple">
</p>

---

## What is JANG?

**JANG** (**J**ang **A**daptive **N**-bit **G**rading) is an open-source quantization format and toolkit that makes large language models run on Apple Silicon at 2-bit precision while staying coherent.

Unlike uniform quantization (where every weight gets the same bits), JANG classifies tensors by sensitivity and gives critical layers (attention) more bits while aggressively compressing the bulk (MLP/experts). The result: a 122B model fits in 46 GB of GPU memory and answers questions correctly — where MLX uniform 2-bit produces garbage.

**Key features:**
- Models stay **quantized in GPU memory** (like GGUF) — no float16 expansion
- Uses MLX **native Metal kernels** (`quantized_matmul`, `gather_qmm`) — full speed
- Supports **every architecture**: MoE, Mamba, MLA, VL, hybrid SSM, dense transformers
- **One command** to quantize any HuggingFace model
- **11 profiles** from extreme 2-bit to near-lossless 6-bit
- Works with **FP8 source models** (MiniMax, DeepSeek)

## Results

### MMLU Benchmark — 122B MoE at 2-bit

200 questions, 10 subjects. Qwen3.5-122B-A10B on M4 Max 128 GB.

| Method | Size | GPU | MMLU |
|--------|------|-----|------|
| **JANG_1L (2.24b)** | **51 GB** | **46 GB** | **73.0%** |
| MLX mixed_2_6 | 44 GB | 45 GB | 46.0% |
| MLX uniform 2-bit | 36 GB | 36 GB | 56.0% |

**JANG scores +27 points over MLX's best mixed-precision mode.** Wins every subject except one.

### Free-Form Quality — 35B MoE at 2-bit

| Prompt | JANG_2L (15 GB) | MLX mixed_2_6 (13 GB) | MLX uniform (10 GB) |
|--------|----------------|----------------------|---------------------|
| What is 2+2? | **"2+2 equals 4" ✅** | Loops ❌ | Number spam ❌ |
| Photosynthesis | **"convert light energy into chemical energy" ✅** | "I cannot respond" ❌ | Garbage ❌ |
| Three planets | **"Jupiter, Saturn, Uranus" ✅** | "Antina" loops ❌ | Number spam ❌ |
| Capital of France | **"Paris" with details ✅** | Never answers ❌ | Partial ⚠️ |

**JANG 4/6. MLX mixed 0/6. MLX uniform 0/6.**

### Why JANG Wins on MoE

MLX `mixed_2_6` only protects `v_proj` + `down_proj` in select layers — a strategy designed for dense models. JANG protects **all attention** everywhere, including:
- GatedDeltaNet linear attention (Qwen3.5)
- MoE expert routing gates
- MLA latent projections (DeepSeek)

On MoE models, 94-98% of parameters are expert MLP. Protecting the other 2-6% at 8-bit costs almost nothing but makes the difference between 73% and 46% MMLU.

> **Note:** JANG is designed for MoE/hybrid models. For dense models (Llama, Mistral), MLX uniform quantization is recommended.

## How It Works

JANG protects the small fraction of weights that control output quality while compressing everything else.

```
CRITICAL  (attention, output head)     →  6-8 bit  →  Controls coherence
IMPORTANT (embeddings, routers)        →  4-8 bit  →  Moderate sensitivity
COMPRESS  (MLP, MoE experts)           →  2-3 bit  →  Bulk of parameters
```

On a 122B MoE model, 98% of parameters are expert MLP. Giving the other 2% more bits costs almost nothing — but makes the difference between working and broken.

## Install

```bash
pip install jang
```

For inference on Apple Silicon:
```bash
pip install "jang[mlx]"
```

Or install from source:
```bash
pip install git+https://github.com/jjang-ai/jangq.git#subdirectory=jang-tools
```

## Quick Start

### Convert any model

```bash
# Simple: pick 1-8 for target bits
jang convert path/to/model -p 2

# Specific profile
jang convert path/to/model -p JANG_1L

# From HuggingFace
jang convert Qwen/Qwen3.5-35B-A3B -p 2
```

### Run inference

```python
# Load and generate (requires: pip install mlx mlx-lm)
from jang_tools.loader import load_jang_model
from mlx_lm.sample_utils import make_sampler
from mlx_lm.generate import generate_step
import mlx.core as mx

model, tokenizer = load_jang_model("path/to/jang-model")
sampler = make_sampler(temp=0.7)

tokens = tokenizer.encode("What is photosynthesis?")
for tok, _ in generate_step(prompt=mx.array(tokens), model=model, max_tokens=200, sampler=sampler):
    print(tokenizer.decode([tok.item()]), end="", flush=True)
    if tok.item() == tokenizer.eos_token_id:
        break
```

JANG models also work with any OpenAI-compatible server that supports MLX (e.g., vMLX, MLX Studio).

### Python API

```python
from jang_tools import convert_model, JANG_PROFILES, load_jang_model

# Convert any HuggingFace model
convert_model("Qwen/Qwen3.5-35B-A3B", "output-JANG_2L", profile="JANG_2L")

# Inspect
model = load_jang_model("output-JANG_2L")
print(model.summary())

# Estimate size before converting
from jang_tools import estimate_size_gb
print(estimate_size_gb(122_000_000_000, "JANG_1L"))
# → {'total_gb': 36.9, 'avg_bits_approx': 2.1, ...}
```

## MMLU Benchmark

200-question MMLU (10 subjects, 20 per subject). Apple M4 Max 128 GB. All quantized in GPU memory.

### Qwen3.5-122B MoE — JANG 73% vs MLX 46%

| Method | Size | GPU | MMLU Score |
|--------|------|-----|------------|
| **JANG_1L (2.24b)** | **51 GB** | **46 GB** | **73.0%** |
| MLX mixed_2_6 | 44 GB | 45 GB | 46.0% |
| MLX uniform 2-bit | 36 GB | 36 GB | 56.0% |

**+27 points over MLX mixed.** JANG wins every subject except one.

### Qwen3.5-35B MoE — JANG 4/6 vs MLX 0/6

| Method | Size | Speed | Free-form Score |
|--------|------|-------|-----------------|
| **JANG_2L (2.28b)** | 15 GB | 100 tok/s | **4/6 correct** |
| MLX mixed_2_6 | 13 GB | 120 tok/s | 0/6 correct |
| MLX uniform 2-bit | 10 GB | 128 tok/s | 0/6 correct |

### Small Dense Models — 65 Wins at the Breaking Point

On dense models (1-7B), JANG wins at the **degradation boundary** — the exact bit level where MLX uniform starts producing garbage:

| Model | JANG | MLX Uniform | Result |
|-------|------|-------------|--------|
| Phi-2 (2.7B) at 2-bit | Correct scientific answer | Empty output | **JANG wins** |
| SmolLM2 (1.7B) at 3-bit | "8 legs" (correct) | Number spam | **JANG wins** |
| Mistral-7B at 3-bit | Correct explanation | Number garbage | **JANG wins** |

65 wins, 0 losses across 7 models. At the breaking point, attention protection prevents catastrophic failure.

### When JANG Helps vs When It Doesn't

| Scenario | Attention % of params | JANG Overhead | Benefit | Verdict |
|----------|----------------------|---------------|---------|---------|
| **MoE at any bit level** | 1-2% | ~2% bigger | Always better attention | **JANG wins** |
| **Dense at breaking point** (2-3 bit) | ~12% | ~12% bigger | Coherent vs garbage | **JANG wins** |
| **Dense at 4-bit+** | ~12% | ~12% bigger | Already works fine | **MLX wins** |

**Why:** On MoE models, expert MLP is 94-98% of parameters. Boosting the other 2% costs almost nothing. On dense models at 4-bit, attention already has enough precision — the 12% overhead for 8-bit attention isn't justified.

**Recommendation:**
- **MoE models** (Qwen3.5 MoE, MiniMax, DeepSeek, Mixtral): Use JANG at any bit level
- **Dense models at extreme compression** (2-3 bit): Use JANG — it's the difference between working and broken
- **Dense models at 4-bit+** (Llama, Mistral, Gemma): Use MLX uniform — JANG overhead isn't worth it

## Profiles

| # | Profile | CRITICAL | IMPORTANT | COMPRESS | Best for |
|---|---------|----------|-----------|----------|----------|
| 1 | `JANG_1L` | 8 | 8 | 2 | Maximum quality ~2-bit |
| 2 | `JANG_2L` | 8 | 6 | 2 | Balanced 2-bit |
| 3 | `JANG_3M` | 8 | 3 | 3 | 3-bit with 8-bit attention |
| 4 | `JANG_4M` | 8 | 4 | 4 | **The standard** — same as MLX 4-bit + 8-bit attention |
| 5 | `JANG_4L` | 8 | 6 | 4 | High quality 4-bit |
| 6 | `JANG_6M` | 8 | 6 | 6 | Near-lossless |

Use `-p 2` as shorthand for `JANG_2L`, `-p 3` for `JANG_3M`, etc.

## Supported Architectures

| Architecture | Examples | Tested |
|-------------|----------|--------|
| Dense Transformer | Llama, Qwen, Gemma, Phi, Mistral | ✅ |
| Mixture of Experts | Mixtral, Qwen3.5 MoE, DeepSeek, MiniMax | ✅ |
| Hybrid SSM + Attention | Jamba, Zamba, Nemotron-H | ✅ |
| Linear Attention | Qwen3.5 GatedDeltaNet | ✅ |
| Multi-head Latent Attention | DeepSeek-V3/R1 | ✅ |
| Vision-Language | Qwen-VL, LLaVA, Pixtral | ✅ |
| Pure SSM | Mamba, Mamba2 | ✅ |
| FP8 Source Models | MiniMax-M2.5, DeepSeek FP8 | ✅ |

## Pre-quantized Models

Available on [HuggingFace](https://huggingface.co/JANGQ-AI):

| Model | Profile | Score | Download |
|-------|---------|-------|----------|
| Qwen3.5-122B-A10B | JANG_1L | 6/6 | [JANGQ-AI/Qwen3.5-122B-A10B-JANG_1L](https://huggingface.co/JANGQ-AI/Qwen3.5-122B-A10B-JANG_1L) |
| Qwen3.5-35B-A3B | JANG_2L | 4/6 | [JANGQ-AI/Qwen3.5-35B-A3B-JANG_2L](https://huggingface.co/JANGQ-AI/Qwen3.5-35B-A3B-JANG_2L) |
| Qwen3.5-27B | JANG_1L | 4/6 | [JANGQ-AI/Qwen3.5-27B-JANG_1L](https://huggingface.co/JANGQ-AI/Qwen3.5-27B-JANG_1L) |

## Format

JANG v1.1 uses `.jang.safetensors` — standard safetensors with per-tensor quantized weights. See [FORMAT.md](FORMAT.md) for the complete specification.

## License

Apache 2.0

## Author

Created by Jinho Jang

<p align="center">
  <a href="https://jangq.ai">jangq.ai</a> •
  <a href="https://github.com/jjang-ai">GitHub</a> •
  <a href="https://huggingface.co/JANGQ-AI">HuggingFace</a>
</p>
