Metadata-Version: 2.4
Name: rk-transformers
Version: 0.1.0
Summary: Accelerate Hugging Face Transformers on Rockchip NPUs.
Author-email: Emmanuel Cortes <manny@derifyai.com>
Maintainer-email: Emmanuel Cortes <manny@derifyai.com>
License: Apache 2.0
Project-URL: Repository, https://github.com/emapco/rk-transformers
Project-URL: Homepage, https://github.com/emapco/rk-transformers
Project-URL: Documentation, https://github.com/emapco/rk-transformers#readme
Project-URL: Bug Tracker, https://github.com/emapco/rk-transformers/issues
Project-URL: Changelog, https://github.com/emapco/rk-transformers/releases
Keywords: rknn,npu,transformers,nlp,embeddings,edge-ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: onnx<=1.18.0,>=1.16.1
Requires-Dist: onnxruntime<2.0.0,>=1.23.2
Requires-Dist: optimum[onnx]<3.0.0,>=2.0.0
Requires-Dist: torch<3.0.0,>=2.2.0
Requires-Dist: transformers[torch]<5.0.0,>=4.55.4
Requires-Dist: huggingface-hub>=0.36.0
Requires-Dist: sentence-transformers<6.0.0,>=5.0.0
Provides-Extra: export
Requires-Dist: rknn-toolkit2==2.3.2; extra == "export"
Requires-Dist: datasets; extra == "export"
Requires-Dist: numpy==1.26.4; extra == "export"
Provides-Extra: inference
Requires-Dist: rknn-toolkit-lite2==2.3.2; extra == "inference"
Requires-Dist: numpy<3.0.0,>=1.26.4; extra == "inference"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: pytest-env; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: ninja; extra == "dev"
Dynamic: license-file

# RK-Transformers: Accelerate Hugging Face Transformers on Rockchip NPUs

<div align="center">

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](https://github.com/emapco/rk-transformers)
[![Star on GitHub](https://img.shields.io/github/stars/emapco/rk-transformers?style=social)](https://github.com/emapco/rk-transformers)

</div>

## 🚀 Overview

**RK-Transformers** is a runtime library that seamlessly integrates Hugging Face `transformers` and `sentence-transformers` with Rockchip's RKNN Neural Processing Units (NPUs). It enables efficient and facile deployment of transformer models on edge devices powered by Rockchip SoCs (RK3588, RK3576, etc.).

## ✨ Key Features

### 🔄 Model Export & Conversion

- **Automatic ONNX Export**: Converts Hugging Face models to ONNX with intelligent input detection
- **RKNN Optimization**: Exports to RKNN format with configurable optimization levels (O0-O3)
- **Quantization**: Post-training quantization (INT8, FP16) with calibration dataset support
- **Push to Hub**: Direct integration with Hugging Face Hub for model versioning

### ⚡ High-Performance Inference

- **NPU Acceleration**: Leverage Rockchip's hardware NPU for 10-100x speedup
- **Multi-Core Support**: Automatic core selection and load balancing across NPU cores
- **Memory Efficient**: Optimized for edge devices with limited RAM

### 🧩 Framework Integration

- **Sentence Transformers**: Drop-in replacement with `backend="rknn"` parameter
- **Transformers API**: Compatible with standard Hugging Face pipelines
- **Multiple Tasks**: Feature extraction, masked LM, sequence classification

## 📦 Installation

### Prerequisites

- Python 3.10 or later
- Linux-based OS (Ubuntu 24.04+ recommended)
- For export: PC with x86_64 architecture
- For inference: Rockchip device with RKNPU2 support (RK3588, RK3576, etc.)

### Quick Install

`uv` is recommended for faster installation and smaller environment footprint.

#### For Inference (on Rockchip devices [arm64])

```bash
uv venv
uv pip install rk-transformers[inference]
```

This installs runtime dependencies including:

- `rknn-toolkit-lite2` (2.3.2)
- `sentence-transformers` (5.x)
- `numpy`, `torch`, `transformers`

#### For Model Export (on development machines [x86_64, arm64])

```bash
uv venv
uv pip install rk-transformers[export]
uv pip install torch==2.4.0  # workaround for rknn-toolkit2 dependency
```

This installs export dependencies including:

- `rknn-toolkit2` (2.3.2)
- `sentence-transformers` (5.x)
- `numpy`, `torch`, `transformers`, `optimum[onnx]`, `datasets`
- All inference dependencies (except `rknn-toolkit-lite2`)

#### For Development (on development machines [x86_64, arm64])

```bash
# Clone the repository
git clone https://github.com/emapco/rk-transformers.git
cd rk-transformers

# Install with development tools
uv venv
uv pip install -e .[dev,export]
uv pip install torch==2.4.0  # workaround for rknn-toolkit2 dependency
```

Development dependencies include:

- `pytest`, `pytest-cov`, `pytest-xdist`
- `ruff` (linting and formatting)
- `pre-commit` (git hooks)

## 🎯 Quick Start

### 1. Export a Model to RKNN

```bash
# Export a Sentence Transformer model from Hugging Face Hub (float16)
rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --optimization-level 3 \
  --opset 19  # Default is 18

# Export with custom dataset for quantization (int8)
rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --quantize \
  --dtype w8a8 \
  --dataset sentence-transformers/natural-questions \
  --dataset-split test \
  --dataset-size 128 \
  --max-seq-length 128 # Default is 512

# Export a local ONNX model
rk-transformers-cli export \
  --model ./my-model/model.onnx \
  --platform rk3588 \
  --batch-size 4 # Default is 1
```

### 2. Run Inference with Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

from rktransformers import patch_sentence_transformer

# Apply RKNN backend patch
patch_sentence_transformer()

# Load model with RKNN backend
model = SentenceTransformer(
    "eacortes/all-MiniLM-L6-v2",
    backend="rknn",
    model_kwargs={"platform": "rk3588", "core_mask": "all"},
)

# Generate embeddings
sentences = ["This is a test sentence", "Another example"]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (2, 384)

# Load specific quantized model file
model = SentenceTransformer(
    "eacortes/all-MiniLM-L6-v2",
    backend="rknn",
    model_kwargs={"platform": "rk3588", "file_name": "rknn/model_w8a8.rknn"},
)
```

### 3. Use RK-Transformers API Directly

```python
from transformers import AutoTokenizer

from rktransformers import RKRTModelForFeatureExtraction

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("eacortes/all-MiniLM-L6-v2")
model = RKRTModelForFeatureExtraction.from_pretrained("eacortes/all-MiniLM-L6-v2", platform="rk3588", core_mask="auto")

# Tokenize and run inference
inputs = tokenizer(
    ["Sample text for embedding"],
    padding="max_length",
    truncation=True,
    return_tensors="np",
)

outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(axis=1)  # Mean pooling
print(embeddings.shape)  # (1, 384)

# Load specific quantized model file
model = RKRTModelForFeatureExtraction.from_pretrained(
    "eacortes/all-MiniLM-L6-v2", platform="rk3588", file_name="rknn/model_w8a8.rknn"
)
```

### 4. Use Transformers Pipelines

```python
from transformers import pipeline

from rktransformers import RKRTModelForMaskedLM

# Load the RKNN model
model = RKRTModelForMaskedLM.from_pretrained(
    "eacortes/bert-base-uncased", platform="rk3588", file_name="rknn/model_w8a8.rknn"
)

# Create a fill-mask pipeline with the RKNN-accelerated model
fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer="eacortes/bert-base-uncased",
    framework="pt",  # required for RKNN
)

# Run inference
results = fill_mask("Paris is the [MASK] of France.")
print(results)
```

### 5. Export Programmatically

```python
from rktransformers import (
    OptimizationConfig,
    QuantizationConfig,
    RKNNConfig,
)
from rktransformers.exporters.rknn.convert import export_rknn

config = RKNNConfig(
    model_id_or_path="sentence-transformers/all-MiniLM-L6-v2",
    output_path="./my-exported-model",
    target_platform="rk3588",
    batch_size=1,
    max_seq_length=128,
    quantization=QuantizationConfig(
        quantized_dtype="w8a8",
        dataset_name="wikitext",
        dataset_size=100,
    ),
    optimization=OptimizationConfig(optimization_level=3),
)

export_rknn(config)
```

## ⚙️ NPU Core Configuration

Rockchip SoCs with multiple NPU cores (like RK3588 with 3 cores or RK3576 with 2 cores) support flexible core allocation strategies through the `core_mask` parameter. Choosing the right core mask can optimize performance based on your workload and system conditions.

### Available Core Mask Options

> **Note**: `core_mask` is specified at inference time.

| Value         | Description                                     | Use Case                                                                                   |
| ------------- | ----------------------------------------------- | ------------------------------------------------------------------------------------------ |
| **`"auto"`**  | Automatic mode - selects idle cores dynamically | **Recommended**: Best for most scenarios, `RKNN runtime` provides automatic load balancing |
| **`"0"`**     | NPU Core 0 only                                 | Fixed core assignment, useful for testing or when other cores are busy                     |
| **`"1"`**     | NPU Core 1 only                                 | Fixed core assignment                                                                      |
| **`"2"`**     | NPU Core 2 only                                 | Fixed core assignment (RK3588 only)                                                        |
| **`"0_1"`**   | NPU Core 0 and 1 simultaneously                 | Parallel execution across 2 cores for larger models                                        |
| **`"0_1_2"`** | NPU Core 0, 1, and 2 simultaneously             | Maximum parallelism (RK3588 only) for demanding models                                     |
| **`"all"`**   | All available NPU cores                         | Equivalent to `"0_1_2"` on RK3588, `"0_1"` on RK3576                                       |

#### Platform-Specific Notes

| Platform          | Available Cores   | Recommended Default                    |
| ----------------- | ----------------- | -------------------------------------- |
| **RK3588**        | 0, 1, 2 (3 cores) | `"auto"` or `"0_1_2"` for large models |
| **RK3576**        | 0, 1 (2 cores)    | `"auto"` or `"0_1"` for large models   |
| **RK3566/RK3568** | 0 (1 core)        | `"0"` (only option)                    |

> **Note**: Attempting to use unavailable cores (e.g., `"2"` on RK3576) may result in a runtime error.

### Usage Examples

#### Python API - Inference

```python
from rktransformers import RKRTModelForFeatureExtraction

# Auto-select idle cores (recommended for production)
model = RKRTModelForFeatureExtraction.from_pretrained("eacortes/all-MiniLM-L6-v2", platform="rk3588", core_mask="auto")

# Use specific core for dedicated workloads
model = RKRTModelForFeatureExtraction.from_pretrained(
    "eacortes/all-MiniLM-L6-v2",
    platform="rk3588",
    core_mask="1",  # Reserve core 0 for other tasks
)

# Use all cores for maximum performance
model = RKRTModelForFeatureExtraction.from_pretrained("eacortes/all-MiniLM-L6-v2", platform="rk3588", core_mask="all")
```

#### Sentence Transformers Integration

```python
from sentence_transformers import SentenceTransformer

from rktransformers import patch_sentence_transformer

patch_sentence_transformer()

model = SentenceTransformer(
    "eacortes/all-MiniLM-L6-v2",
    backend="rknn",
    model_kwargs={"platform": "rk3588", "core_mask": "auto"}
)
```

## ⚠️ RKNN Limitations

### Dynamic Inputs & Static Shapes

Current RKNN support for dynamic inputs is **experimental and not fully functional**. As a result, all models exported via `rk-transformers` use **static input shapes** defined at export time.

- **Performance Impact**: The NPU allocates memory based on the static shape. If you export with `max_seq_length=512` but only infer on 10 tokens, the NPU still processes the full 512-token padding, leading to inefficient inference.
- **Usage**: You must ensure your input tensors match the exported dimensions (or use padding).
- **Recommendation**: Export multiple versions of your model optimized for different sequence lengths (e.g., 128, 256, 512) if your workload varies significantly.

### Quantization Support

While the tool supports various quantization data types, many are **experimental**.

- **`w8a8` (Weights 8-bit, Activations 8-bit)**: The only widely supported and tested configuration. Recommended for most use cases.
- Other formats (e.g., `w8a16`, `w16a16i`) may cause conversion failures or runtime errors depending on the specific model operators and RKNN toolkit version.

## Architecture

### Runtime Loading Workflow

1. **Model Discovery**: `RKRTModel.from_pretrained()` searches for `.rknn` files
2. **Config Matching**: Reads `rknn.json` to match platform and constraints
3. **Platform Validation**: Checks compatibility with `RKNNLite.list_support_target_platform()`
4. **Runtime Init**: Loads model to NPU with specified core mask
5. **Inference**: Runs forward pass with automatic input/output handling

### Cross-Component Communication

```mermaid
graph TB
    subgraph "Export Pipeline"
        HF[Hugging Face Model]
        OPT[Optimum ONNX Export]
        ONNX[ONNX Model]
        RKNN_TK[RKNN Toolkit]
        RKNN_FILE[.rknn File]
        
        HF -->|main_export| OPT
        OPT -->|ONNX graph| ONNX
        ONNX -->|load_onnx| RKNN_TK
        RKNN_TK -->|build/export| RKNN_FILE
    end
    
    subgraph "Inference Pipeline"
        RKNN_FILE -->|load| RKNN_LITE[RKNNLite Runtime]
        RKNN_LITE -->|init_runtime| NPU[RKNPU2 Hardware]
        NPU -->|inference| RESULTS[Model Outputs]
    end
    
    subgraph "Framework Integration"
        ST[Sentence Transformers]
        HFHUB[Hugging Face Hub]
        PATCH[patch_sentence_transformer]
        RKRT[RKRTModel Classes]
        
        ST -->|backend='rknn'| PATCH
        PATCH -->|load_rknn_model| RKRT
        HFHUB -->|from_pretrained| RKRT
        RKRT -->|inference| RKNN_LITE
    end
    
    style NPU fill:#ff9900
    style RKNN_TK fill:#66ccff
    style RKNN_LITE fill:#66ccff
```

### Configuration Files

#### `rknn.json`

Generated during export and stored alongside the model:

```json
{
  "model.rknn": {
    "platform": "rk3588",
    "batch_size": 1,
    "max_seq_length": 128,
    "model_input_names": ["input_ids", "attention_mask"],
    "quantized_dtype": "w8a8",
    "optimization_level": 3,
    ...
  },
  "rknn/optimized.rknn": {
    ...
  }
}
```

The keys are relative paths to `.rknn` files, allowing multiple optimized variants per model.

## 🤝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

<details><summary>Click to show local development details.</summary>

### Development Setup

```bash
git clone https://github.com/emapco/rk-transformers.git
cd rk-transformers
uv venv
uv pip install -e .[dev,export]
uv pip install torch==2.4.0  # workaround for rknn-toolkit2 dependency
pre-commit install
```

### Running Tests

```bash
# Run all tests (excludes manual tests)
make test

# Run with coverage report
make test-cov

# Run specific test categories
pytest -m integration tests -v          # Integration tests only
pytest -m "not slow" tests -v           # Skip slow tests
pytest -m requires_rknpu tests -v        # Tests requiring Rockchip hardware
```

### Linting and Formatting

```bash
# Auto-fix linting issues and format code
make lint

# Run pre-commit hooks manually
pre-commit run --all-files
```

### Environment Diagnostics

Check your Rockchip environment and library versions:

```bash
rk-transformers-cli env
```

Output example:

```bash
Copy-and-paste the text below in your GitHub issue:

- Operating system: Linux-5.10.160-rockchip-rk3588
- Rockchip Board: Orange Pi 5 Plus
- Rockchip SoC: rk3588
- RKNPU2 Driver version: 0.9.8
- RKNN Runtime version: 2.3.2
- RKNN Toolkit version: rknn-toolkit-lite2==2.3.2
- Python version: 3.12.9
- PyTorch version: 2.4.0+cpu
- HuggingFace transformers version: 4.55.4
- HuggingFace optimum version: 2.0.0
```

</details>

## 📄 License

This project is licensed under the **Apache License 2.0**.

## 🙏 Acknowledgments

- **Hugging Face** for the `transformers`, `sentence-transformers` and `optimum` libraries
- **Rockchip** for RKNN toolkit and NPU hardware

## 🔗 Links

- **Repository**: [https://github.com/emapco/rk-transformers](https://github.com/emapco/rk-transformers)
- **Issues**: [https://github.com/emapco/rk-transformers/issues](https://github.com/emapco/rk-transformers/issues)
- **Changelog**: [https://github.com/emapco/rk-transformers/releases](https://github.com/emapco/rk-transformers/releases)
- **Rockchip NPU Docs**: [https://github.com/rockchip-linux/rknn-toolkit2](https://github.com/rockchip-linux/rknn-toolkit2/tree/master/doc)
