Metadata-Version: 2.4
Name: smol-vllm
Version: 0.1.0
Summary: From-scratch paged-attention inference engine: paged KV cache, continuous batching, preemption
Author: smol-vllm
License: MIT
Keywords: llm,inference,vllm,paged-attention,continuous-batching
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"

# smol-vllm

From-scratch paged-attention inference engine: paged KV cache, continuous batching, preemption. Pure Python, no external deps.

## Install

```bash
pip install smol-vllm
```

Or from source:

```bash
pip install .
```

## Usage

```python
from smol_vllm import LLMEngine

engine = LLMEngine(num_gpu_blocks=64, block_size=16, max_batch_size=8)

# Single request (streaming)
for token in engine.generate([1, 2, 3, 4, 5], max_tokens=20):
    print(token, end=" ")

# Batched: add requests and step
engine.add_request([10, 20, 30], max_tokens=10)
engine.add_request([40, 50, 60], max_tokens=10)
while True:
    outputs = engine.step()
    for out in outputs:
        print(out.output_tokens)
    if all(o.finished for o in outputs):
        break
```

## Demo

```bash
pip install smol-vllm
smol-vllm-demo
```

## License

MIT
