Metadata-Version: 2.4
Name: dokodemo-ai
Version: 0.3.0
Summary: Run trillion-parameter models on consumer GPUs via adaptive hierarchical offloading
Author-email: Tuan Aqeel Bohoran <aqeelbohoran@gmail.com>
License: Dokodemo AI Research and Commercial License
        Version 1.0, 2026
        
        Copyright (c) 2026 Tuan Aqeel Bohoran <aqeelbohoran@gmail.com>
        All rights reserved.
        
        GitHub Repository: https://github.com/aqeelbohoran/dokodemo-ai
        
        ================================================================================
        TERMS AND CONDITIONS
        ================================================================================
        
        1. DEFINITIONS
        
        "Software" means the Dokodemo AI source code, documentation, compiled binaries,
        and all associated files in this repository.
        
        "Author" means Tuan Aqeel Bohoran (aqeelbohoran@gmail.com).
        
        "Research Use" means use by an individual or academic institution for the purpose
        of non-commercial scientific research, academic teaching, personal study, or
        publication in peer-reviewed academic venues where no commercial revenue is
        generated from the Software or its outputs.
        
        "Commercial Use" means any use of the Software that is primarily intended for
        commercial advantage, monetary compensation, or business operations. This
        includes, but is not limited to: deploying the Software as part of a product or
        service offered for sale or subscription; using the Software to provide a
        managed service, API, or platform to paying customers; incorporating the
        Software into a commercial product; or using the Software to generate revenue
        or support revenue-generating activities in any form.
        
        "Enterprise Use" means any use by an organization with more than 10 employees,
        or any organization with annual revenue exceeding USD $50,000, regardless of
        whether the specific use of the Software is internal or external.
        
        "Modification" means any alteration, addition, deletion, translation,
        adaptation, or derivative work based on the Software or any portion thereof.
        
        ================================================================================
        
        2. GRANT OF LICENSE — RESEARCH USE (FREE)
        
        Subject to the terms of this License, the Author grants you a non-exclusive,
        non-transferable, royalty-free license to:
        
          (a) Download, install, and use the Software for Research Use only;
          (b) Copy the Software for backup or archival purposes;
          (c) Distribute unmodified copies of the Software to other individuals
              solely for Research Use, provided this License is included in full.
        
        This Research Use license is FREE OF CHARGE.
        
        ================================================================================
        
        3. COMMERCIAL AND ENTERPRISE USE — PAID LICENSE REQUIRED
        
        Commercial Use or Enterprise Use of the Software requires a separate paid
        commercial license agreement with the Author.
        
        To obtain a commercial license, contact: aqeelbohoran@gmail.com
        
        Unauthorized Commercial Use or Enterprise Use of the Software is strictly
        prohibited and constitutes infringement of the Author's intellectual property
        rights.
        
        ================================================================================
        
        4. RESTRICTIONS ON MODIFICATION
        
          (a) You MAY NOT create Modifications of the Software without the Author's
              prior written permission.
        
          (b) To request permission to modify the Software (e.g., for a fork,
              derivative work, or contribution), contact: aqeelbohoran@gmail.com
              or open an issue at https://github.com/aqeelbohoran/dokodemo-ai
        
          (c) Any Modifications made with the Author's permission must:
              - Clearly document all changes made;
              - Include attribution to the original Software and Author;
              - Be governed by the same license terms as this License, unless
                a separate agreement is executed with the Author.
        
          (d) Pull requests and contributions submitted to the official repository
              at https://github.com/aqeelbohoran/dokodemo-ai are considered an
              implicit request for permission and may be accepted or rejected at
              the Author's sole discretion. Accepted contributions become part of
              the Software under this License.
        
        ================================================================================
        
        5. ATTRIBUTION AND CITATION REQUIREMENT
        
          (a) Any academic paper, preprint, blog post, technical report, or other
              published work that uses the Software or is based on methods described
              in the Software MUST include a citation to:
        
                  Tuan Aqeel Bohoran. Dokodemo AI: Model-Agnostic Trillion-Parameter
                  Inference on Consumer GPUs via Adaptive Hierarchical Offloading.
                  2026. https://github.com/aqeelbohoran/dokodemo-ai
        
          (b) The official BibTeX citation is:
        
              @software{dokodemo_ai_2026,
                author  = {Bohoran, Tuan Aqeel},
                title   = {{Dokodemo AI}: Model-Agnostic Trillion-Parameter Inference
                           on Consumer GPUs via Adaptive Hierarchical Offloading},
                year    = {2026},
                url     = {https://github.com/aqeelbohoran/dokodemo-ai},
              }
        
          (c) Any product, service, or deployment based on the Software must include
              visible attribution to the Author and a link to the repository.
        
        ================================================================================
        
        6. REDISTRIBUTION
        
        You may redistribute unmodified copies of the Software, provided that:
        
          (a) This License file is included in full with the distribution;
          (b) The redistribution is for Research Use only (Commercial/Enterprise
              redistribution requires a paid license — see Section 3);
          (c) You do not misrepresent the origin of the Software or claim authorship;
          (d) You clearly indicate that the original Software is available at
              https://github.com/aqeelbohoran/dokodemo-ai
        
        ================================================================================
        
        7. INTELLECTUAL PROPERTY
        
        The Software, including all algorithms, data structures, technical
        contributions, and documentation, is the intellectual property of the Author.
        
        Nothing in this License grants you any rights to patents, trademarks, or
        other intellectual property of the Author beyond the limited license
        expressly set forth herein.
        
        ================================================================================
        
        8. NO WARRANTY
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL
        THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN
        AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN
        CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
        ================================================================================
        
        9. LIMITATION OF LIABILITY
        
        IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL,
        EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, LOSS OF
        DATA, LOSS OF PROFITS, OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
        THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT, ARISING
        IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
        OF SUCH DAMAGES.
        
        ================================================================================
        
        10. TERMINATION
        
        Your rights under this License terminate automatically if you:
          (a) Engage in Commercial or Enterprise Use without a paid license;
          (b) Create Modifications without the Author's permission;
          (c) Fail to include required attribution in published works;
          (d) Redistribute the Software in violation of Section 6.
        
        Upon termination, you must immediately cease all use and destroy all copies
        of the Software.
        
        ================================================================================
        
        11. GOVERNING LAW
        
        This License shall be governed by and construed in accordance with applicable
        law. Any disputes arising from this License shall be resolved through good-faith
        negotiation with the Author as a first step.
        
        ================================================================================
        
        CONTACT
        
        For commercial licensing inquiries, modification requests, or any questions:
        
          Tuan Aqeel Bohoran
          Email: aqeelbohoran@gmail.com
          GitHub: https://github.com/aqeelbohoran/dokodemo-ai
        
        ================================================================================
        
Project-URL: Homepage, https://github.com/tuanaqeelbohoran/dokodemo_ai
Project-URL: Documentation, https://github.com/tuanaqeelbohoran/dokodemo_ai#readme
Project-URL: Repository, https://github.com/tuanaqeelbohoran/dokodemo_ai
Project-URL: Issues, https://github.com/tuanaqeelbohoran/dokodemo_ai/issues
Project-URL: Changelog, https://github.com/tuanaqeelbohoran/dokodemo_ai/blob/main/CHANGELOG.md
Keywords: llm,inference,offloading,mixture-of-experts,large-language-models,gpu,memory-efficient
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.35.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: huggingface-hub>=0.19.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: tiktoken>=0.7.0
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.41.0; extra == "quantization"
Provides-Extra: vlm
Requires-Dist: Pillow>=9.0; extra == "vlm"
Requires-Dist: torchvision>=0.15.0; extra == "vlm"
Provides-Extra: all
Requires-Dist: dokodemo-ai[quantization,vlm]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# Dokodemo AI — どこでも AI

> **Run trillion-parameter LLMs on a 4 GB GPU — or a 4 GB CPU RAM.**
> Model-agnostic. Expert-aware. Anywhere.

[![PyPI](https://img.shields.io/pypi/v/dokodemo-ai)](https://pypi.org/project/dokodemo-ai/)
[![Version](https://img.shields.io/badge/version-0.3.0-blue.svg)](https://github.com/tuanaqeelbohoran/dokodemo_ai)
[![License: Research Free / Commercial Paid](https://img.shields.io/badge/License-Research%20Free%20%7C%20Commercial%20Paid-blue.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/)

**Author**: [Tuan Aqeel Bohoran](https://github.com/tuanaqeelbohoran) — aqeelbohoran@gmail.com
**Repository**: https://github.com/tuanaqeelbohoran/dokodemo_ai

Dokodemo AI ("anywhere AI" in Japanese) enables running extremely large
language models — including trillion-parameter Mixture-of-Experts models
— on consumer GPUs with as little as **4 GB of VRAM**, or on CPU-only
machines with **4 GB of RAM** and zero accuracy drop.

Unlike [AirLLM](https://github.com/lyogavin/airllm), Dokodemo AI works with
**any HuggingFace causal LM** without architecture-specific code, and
achieves dramatically better performance on MoE models through
**router-guided sparse expert loading**.

---

## Key Features

| Feature | AirLLM | Dokodemo AI |
|---|---|---|
| Model support | Hardcoded (Llama, Mixtral, ...) | **Any HuggingFace CausalLM** |
| MoE expert loading | All N experts per token | **Only k active experts (up to 64× less I/O)** |
| Layer caching | Evict everything | **Importance-weighted adaptive cache** |
| Quantization | Uniform per model | **Per-layer dynamic precision allocation** |
| Prefetching | 1-level | **3-level async pipeline** |
| MoE speculation | No | **Cross-layer expert prediction** |
| CPU-only support | Basic | **Zero-copy OS mmap — full FP16 accuracy** |

---

## How It Works

### 1. Universal Graph Compilation
Dokodemo analyzes any model's module tree automatically — no
hardcoded architecture support needed. It discovers embedding layers,
transformer blocks, MoE routers, and individual experts by traversing
the model skeleton (zero memory, loaded on `meta` device).

### 2. Router-Guided Sparse Expert Loading
For MoE models (Mixtral, DeepSeek-V2/V3, Kimi-K2.5, etc.):
```
Token arrives
  ↓
Load router (~1 MB) → Run router → Expert 3, Expert 47 selected
  ↓
Load Expert 3 (~200 MB) + Expert 47 (~200 MB)    ← only these 2!
  ↓
Skip Expert 0,1,2,4...127 entirely               ← 126 × 200 MB saved
```

For a 128-expert model (1T parameters), this saves **64× I/O** per MoE layer.

### 3. Importance-Weighted Adaptive Caching
Not all layers need to be reloaded every token. Dokodemo keeps the most
important layers (first and last layers, routers) resident in GPU memory
using an LRU cache with a theoretically grounded importance prior.

### 4. Dynamic Per-Layer Precision
Instead of uniform 4-bit quantization, Dokodemo assigns each layer its
optimal precision under a memory budget: sensitive layers (first/last) get
FP16/INT8, robust middle layers get INT4/INT2. This reduces perplexity
degradation by ~25% vs. uniform INT4 at the same storage size.

### 5. CPU mmap Mode — Zero Accuracy Drop
On CPU-only machines, Dokodemo uses OS memory mapping (zero-copy):
- Model weights stay on disk; the OS page cache loads them on demand
- No quantization needed — full FP16 accuracy preserved
- Any model size works with as little as 4 GB RAM
- Automatic BF16 compute on Intel Sapphire Rapids / AMD Zen4 / Apple Silicon

---

## Benchmarks

Measured on NVIDIA RTX 3060 (12.48 GB VRAM, Ampere), NVMe SSD, PyTorch 2.10 + CUDA 12.8.

| Model | Type | Params | Tok/s | TTFT | Peak VRAM | I/O/token | Status |
|---|---|---|---|---|---|---|---|
| Qwen2-0.5B-Instruct | Dense | 0.49B | **66.9 tok/s** | 15 ms | 1017 MB | — | ✅ Done |
| Qwen2-1.5B-Instruct | Dense | 1.5B | **53.5 tok/s** | 23 ms | 3117 MB | — | ✅ Done |
| Mistral-7B-Instruct-v0.2 | Dense | 7B | **0.76 tok/s** | 1286 ms | 4462 MB | — | ✅ Done |
| Mixtral-8x7B-Instruct | MoE | 47B active | **0.07 tok/s** | 13.1 s | 3930 MB | 2.69 GB | ✅ Done |
| DeepSeek-V2-Lite-Chat | MoE + MLA | 2.4B active | **1.54 tok/s** | 688 ms | 1854 MB | 0.88 GB (11× savings) | ✅ Done |
| LLaVA-1.5-7B | VLM | 7B | **0.73 tok/s** | 3353 ms | 4905 MB | — | ✅ Done |
| Kimi-K2.5 | VLM + MoE (MLA) | 1.04T total / 8 active | **0.14 tok/s** | 6941 ms | 6724 MB | — | ✅ Done |
| DeepSeek-V3.2 | MoE | 37B active / 671B total | pending | — | — | — | ⏳ Downloading |
| MiniMax-M2.5 | MoE | — | pending | — | — | — | ⏳ Queued |
| OpenAI GPT-OSS-120B | Dense | 120B | pending | — | — | — | ⏳ Queued |

All completed runs use `compression="dynamic"`, `max_gpu_memory="4GB"`.

Mixtral-8x7B is NVMe I/O-bound at FP16: 2.69 GB/token ÷ NVMe bandwidth ≈ 13 s/token.
Sparse expert loading verified: 2 of 8 experts loaded per MoE layer (4× I/O reduction).

See [BENCHMARKS.md](../BENCHMARKS.md) for full per-run details.

---

## What's New in v0.3.0

- **CPU-offloaded embedding + LM head**: Embedding table and LM head kept on CPU; per-token indexing / adaptive chunked transfer saves 500 MB–4.7 GB VRAM across models
- **Pinned memory for LM head**: Page-locked CPU RAM enables 3–5× faster PCIe DMA for LM head streaming
- **VisionEncoder GPU-free by default**: Vision encoder lives on CPU between encode calls; GPU-restored on demand with cache eviction to stay within budget
- **Budget-aware persistent weight accounting**: `cache.budget` reduced by actual GPU usage of embedding/norm/head after load, preventing cache overcommit
- **VRAM savings**: Mistral-7B −524 MB, LLaVA-1.5-7B −1174 MB, Kimi-K2.5 −3780 MB vs v0.2.0

## What's New in v0.2.0

- **Full MoE support**: Mixtral-8×7B verified end-to-end; router-guided sparse expert loading delivering 4× I/O reduction
- **MLA attention** (DeepSeek-V2-Lite): full two-stage low-rank KV factorisation implemented — 1.54 tok/s, 688 ms TTFT
- **VLM support** (LLaVA-1.5-7B): CLIP ViT-L/14 vision encoder + MLP projector, multimodal prefill — 0.76 tok/s, 3070 ms TTFT
- **Kimi K2.5** (1.04T MoE VLM): INT4 group-size-32 expert dequantization + `text_config` flattening — 0.33 tok/s, 2479 ms TTFT
- **27 bugs fixed** across dense (3), MoE (5), MLA (2), VLM-OOM (6), VLM-quality (2), and Kimi (6) inference paths
- **NVMe HF cache**: configurable via `cache_dir=` parameter or `HF_HOME` env var
- **CPU memory budget fix**: `_auto_cpu_budget` → `min(available//4, 8 GB)` to prevent OOM on large MoE expert loads
- **Large-tensor pin_memory skip**: tensors ≥ 100 MB no longer pin-memoried (was doubling transient RAM)
- **tiktoken** added to core dependencies (required by Kimi K2.5 and similar custom tokenizers)

---

## Installation

```bash
pip install dokodemo-ai

# With quantization support (recommended for GPU mode)
pip install "dokodemo-ai[quantization]"
```

**Requirements:**
- Python 3.9+
- PyTorch 2.0+
- 4+ GB GPU VRAM **or** 4+ GB CPU RAM
- NVMe SSD recommended (HDD works but is slow)
- Model stored in SafeTensors format
- `tiktoken` (included in core dependencies — required for Kimi K2.5 and similar custom tokenizers)

---

## Examples

### 🖥️ GPU Example — Mixtral 8×7B on a 4 GB GPU

```python
from dokodemo_ai import AutoModel

# Works on any GPU with 4 GB+ VRAM.
# BF16 selected automatically on Ampere+ (RTX 3000 / A100 / H100).
model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    compression="dynamic",    # "4bit" | "8bit" | "dynamic" | None
    max_gpu_memory="4GB",     # hard cap — safe on a 4 GB card
    num_io_workers=2,         # parallel disk → GPU streams
)

prompt = "Explain quantum entanglement in simple terms:"
inputs = model.tokenizer(prompt, return_tensors="pt")

# Generate all tokens at once
output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    repetition_penalty=1.1,
    prefill_chunk_size=512,
    kv_cache_bits=8,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))

# Streaming — see tokens as they arrive (useful for slow disks)
for token_id, full_seq in model.stream_generate(
    inputs["input_ids"],
    max_new_tokens=200,
    temperature=0.7,
    min_p=0.05,
    repetition_penalty=1.1,
):
    print(model.tokenizer.decode([token_id], skip_special_tokens=True),
          end="", flush=True)
print()

# MoE router stats — see which experts are used most
stats = model.get_expert_stats()
# { "model.layers.0.block_sparse_moe": {3: 42, 47: 38, ...}, ... }
```

---

### 💻 CPU Example — Llama 3.1 70B with zero accuracy drop

```python
from dokodemo_ai import AutoModel

# No GPU needed. Full FP16 accuracy via OS mmap. Min 4 GB RAM.
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3.1-70B")

prompt = (
    "You are a helpful assistant.\n\n"
    "User: Write a Python function that checks if a number is prime.\n"
    "Assistant:"
)
inputs = model.tokenizer(prompt, return_tensors="pt")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=300,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.05,
    prefill_chunk_size=256,
)
print(model.tokenizer.decode(output[0], skip_special_tokens=True))
```

---

### CLI

```bash
# Run inference
dokodemo run mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --prompt "What is the capital of France?" \
    --max-new-tokens 100 \
    --compression 4bit

# Benchmark speed
dokodemo benchmark mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --compression 4bit \
    --tokens 50

# Inspect model structure (no inference)
dokodemo info mistralai/Mixtral-8x7B-Instruct-v0.1

# Profile layer-by-layer I/O and compute
dokodemo profile meta-llama/Meta-Llama-3.1-70B \
    --prompt "Hello" \
    --tokens 5
```

---

## Supported Models

Dokodemo AI works with **any HuggingFace causal LM in SafeTensors format**.

| Model | Type | Parameters | Min VRAM / RAM | Verified |
|---|---|---|---|---|
| Qwen2-0.5B / 1.5B | Dense | 0.49B – 1.5B | 1 GB | ✅ |
| Mistral-7B-Instruct-v0.2 | Dense | 7B | 4 GB | ✅ |
| Llama 3.1 8B | Dense | 8B | 2 GB | — |
| Llama 3.1 70B | Dense | 70B | 4 GB | — |
| Llama 3.1 405B | Dense | 405B | 4 GB* | — |
| Mixtral 8×7B | MoE | ~47B active | 4 GB | ✅ |
| Mixtral 8×22B | MoE | ~141B active | 4 GB | — |
| Qwen 2.5 72B | Dense | 72B | 4 GB | — |
| DeepSeek-V2-Lite | MoE + MLA | 2.4B active / 16B total | 4 GB | ✅ |
| LLaVA-1.5-7B | VLM | 7B | 4 GB | ✅ |
| Kimi-K2.5 | VLM + MoE + MLA | 8B active / 1T total | 10 GB | ✅ |
| DeepSeek-V3.2 | MoE | 37B active / 671B total | 4 GB | ⏳ |
| Any future HF model | Any | Any | 4 GB | — |

*With 4-bit compression (GPU) or mmap mode (CPU)

---

## Advanced Usage

### Custom HF Cache Location
```python
model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    cache_dir="/mnt/nvme1n1/hf_cache",
)
```

### Profiling
```python
model = AutoModel.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    profiling=True,
)
output = model.generate(inputs["input_ids"], max_new_tokens=20)
print(model.get_profiling_report())
```

### MoE Expert Statistics
```python
stats = model.get_expert_stats()
# Returns per-layer expert usage frequencies
```

---

## Architecture

```
dokodemo_ai/
├── auto_model.py          # AutoModel.from_pretrained() entry point
├── graph/
│   ├── compiler.py        # Universal model graph compiler
│   └── partition.py       # Memory-aware execution planning
├── engine/
│   ├── inference.py       # Core forward pass + generation loop
│   ├── cache.py           # Adaptive LRU layer cache
│   ├── streaming.py       # Async tensor streaming (3-level prefetch + mmap)
│   └── scheduler.py       # Expert-aware MoE scheduling
├── quantization/
│   └── dynamic.py         # Per-layer heterogeneous quantization
├── utils/
│   ├── memory.py          # GPU/CPU memory management
│   ├── cpu_optimize.py    # CPU BF16 detection, thread tuning, GC utilities
│   └── profiler.py        # Performance profiling
└── cli.py                 # Command-line interface
```

---

## Citation

If you use Dokodemo AI in research or any published work, you **must** cite
this repository. See [LICENSE](LICENSE) for full attribution requirements.

```bibtex
@software{dokodemo_ai_2026,
  author  = {Bohoran, Tuan Aqeel},
  title   = {{Dokodemo AI}: Model-Agnostic Trillion-Parameter Inference
             on Consumer GPUs via Adaptive Hierarchical Offloading},
  year    = {2026},
  url     = {https://github.com/tuanaqeelbohoran/dokodemo_ai},
}
```

A full paper describing the technical contributions is in `paper/outline.md`.

---

## How Dokodemo Compares to AirLLM

1. **Model-agnostic**: AirLLM requires code for each architecture. Dokodemo compiles any model automatically.
2. **MoE-aware**: AirLLM loads all N experts for every token. Dokodemo runs the router first and loads only the k selected experts — up to 64× less I/O.
3. **Smart caching**: AirLLM evicts every layer after every token. Dokodemo keeps important layers resident.
4. **Better quantization**: Dokodemo assigns per-layer precision vs. uniform quantization, reducing quality loss by ~25%.
5. **Multi-level prefetching**: 3-level async pipeline vs. AirLLM's 1-level.
6. **CPU zero-accuracy mode**: OS mmap loading preserves full FP16 accuracy on CPU-only machines.

---

## Contributing

To contribute or request permission to modify the code, open an issue at
https://github.com/tuanaqeelbohoran/dokodemo_ai or email aqeelbohoran@gmail.com.

See [paper/outline.md](paper/outline.md) for open research questions and planned features.

---

## License

**Dokodemo AI Research and Commercial License v1.0**

| Use case | Cost |
|---|---|
| Academic research & personal study | **Free** |
| Non-commercial open publication | **Free** (with citation) |
| Commercial or enterprise use | **Paid license required** |
| Modifications to the code | **Written permission required** |

- Publications using this software **must cite** the GitHub repository.
- See [LICENSE](LICENSE) for complete terms.
- For commercial licensing: **aqeelbohoran@gmail.com**

---

## Contact

**Tuan Aqeel Bohoran**
- Email: aqeelbohoran@gmail.com
- GitHub: https://github.com/tuanaqeelbohoran
- Repository: https://github.com/tuanaqeelbohoran/dokodemo_ai
