Metadata-Version: 2.4
Name: cggr
Version: 0.5.0
Summary: Confidence-Gated Gradient Routing for Efficient Transformer Training
Home-page: https://github.com/MinimaML/CGGR
Author: MinimaML
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Provides-Extra: cuda
Requires-Dist: triton>=2.0.0; extra == "cuda"
Provides-Extra: flash
Requires-Dist: flash-attn>=2.0.0; extra == "flash"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: rich; extra == "dev"
Provides-Extra: benchmark
Requires-Dist: transformers; extra == "benchmark"
Requires-Dist: rich; extra == "benchmark"
Provides-Extra: all
Requires-Dist: triton>=2.0.0; extra == "all"
Requires-Dist: flash-attn>=2.0.0; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: rich; extra == "all"
Requires-Dist: pytest; extra == "all"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CGGR - Confidence-Gated Gradient Routing

> [!WARNING]
> This is highly experimental, and prepare for the worst hope for the best.

Selective loss computation for Transformer training. Only hard tokens contribute to loss, providing actual backward pass savings.

## Installation

```bash
pip install cggr
```

For CUDA acceleration with Triton kernels (Linux/Windows):

```bash
pip install cggr[cuda]
```

## Platform Compatibility

| Platform            | Triton Kernels | PyTorch Fallback |    Status    |
| ------------------- | :------------: | :--------------: | :----------: |
| CUDA (Linux)        |       ✓        |        ✓         | Full Support |
| CUDA (Windows)      |       ✓        |        ✓         | Full Support |
| ROCm (AMD)          |       ✗        |        ✓         |  Supported   |
| MPS (Apple Silicon) |       ✗        |        ✓         |  Supported   |
| CPU                 |       ✗        |        ✓         |  Supported   |

## Model Architecture Support

| Architecture                   | Auto-Detect | Notes                     |
| ------------------------------ | :---------: | ------------------------- |
| Llama/Mistral/Qwen/Gemma/Phi-3 |      ✓      | `model.layers` style      |
| GPT-2/GPT-J/Falcon/GPT-NeoX    |      ✓      | `transformer.h` style     |
| BERT/RoBERTa                   |      ✓      | `encoder.layer` style     |
| Mamba/SSM                      |      ✓      | `backbone.layers` style   |
| Other                          | Passthrough | Uses full model as router |

## Flash Attention Support

CGGR supports Flash Attention for memory-efficient attention computation:

```bash
# Install with Flash Attention support
pip install cggr[flash]
```

```python
from cggr_flash import load_model_with_flash_attention, enable_flash_attention

# Option 1: Load model with Flash Attention
model = load_model_with_flash_attention("microsoft/phi-2")

# Option 2: Enable on existing model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("...")
model = enable_flash_attention(model)  # Auto-selects best backend
```

| Backend             | Requirements       | Speed    |
| ------------------- | ------------------ | -------- |
| `flash_attention_2` | flash-attn library | Fastest  |
| `sdpa`              | PyTorch 2.0+       | Fast     |
| `eager`             | None               | Baseline |

## Why CGGR?

| Metric              | Standard Training      | CGGR (Batch Split)      | Benefit                           |
| :------------------ | :--------------------- | :---------------------- | :-------------------------------- |
| **Backward Pass**   | 100% of tokens         | 25% of tokens           | 4x cheaper backward pass          |
| **Forward Pass**    | 1.0x cost              | ~1.1x cost (Pass 1 + 2) | Negligible overhead (~9ms)        |
| **Total Speed**     | 1.0x (Baseline)        | 1.4x - 2.0x faster      | Significant training acceleration |
| **Data Efficiency** | Learns from all tokens | Prioritizes hard tokens | Learns faster from hard examples  |
| **Memory**          | High (full graph)      | Lower (sparse graph)    | Can increase batch size           |

## Benchmark Race Results (2026)

**Model:** HuggingFaceTB/SmolLM-135M  
**Dataset:** AI-MO/NuminaMath-1.5  
**Evaluation:** AIME 2024 (Math Reasoning)  

> [!IMPORTANT]
> **Key Result:** CGGR achieved 4x higher sample throughput and +1.5% Accuracy by utilizing idle compute cycles.

| Metric                      | Standard (BS=1) | CGGR (BS=4) | Improvement |
| :-------------------------- | :-------------- | :---------- | :---------- |
| **Final Accuracy (AIME)**   | 8.00%           | **9.50%**   | **+1.50%**  |
| **Final Loss**              | 0.3610          | **0.0980**  | **-73%**    |
| **Total Samples Processed** | ~14,368         | ~58,716     | **4.08x**   |
| **Wall Clock Time**         | 6 Hours         | 6 Hours     | Same        |

### The "Free Lunch" Phenomenon
In standard training (Batch Size = 1), high-end GPUs are often latency-bound, spending more time waiting for memory transfers than doing math. 

CGGR exploits this by quadrupling the batch size (Batch Size = 4) without increasing the step time.
*   **Step Latency:** Unchanged (1.02x throughput ratio in steps/sec).
*   **Data Throughput:** 4.08x higher (samples/sec).

By processing 4x more data in the same timeframe, CGGR converges significantly faster and deeper (Loss 0.09 vs 0.36).

## Quick Start

### 1. Batch Splitting (Recommended)

The most efficient way to use CGGR is via `CGGRModel`. It uses a lightweight router to score difficulty and only computes gradients for hard tokens.

```python
from cggr import CGGRModel, create_truncated_router
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("...").cuda()

# Create lightweight router (shares weights, 0 extra memory)
router = create_truncated_router(model, num_layers=4)

# Wrap model
cggr_model = CGGRModel(
    model, 
    router=router, 
    min_tokens_ratio=0.25
)

# Train
loss = cggr_model(input_ids, labels=labels)
loss.backward()
```

### 2. Manual Integration (CGGRLoss)

If you cannot use `CGGRModel` (e.g. specialized architectures), you can use `CGGRLoss` manually.

```python
from cggr import CGGRLoss

criterion = CGGRLoss(
    scoring='combined',      # 'entropy', 'margin', 'loss', 'combined'
    selection='stratified',  # 'topk', 'stratified', 'sequence_aware'
    min_tokens_ratio=0.25,
    warmup_steps=1000,
)

for batch in dataloader:
    logits = model(input_ids)
    loss = criterion(logits, targets)  # Only hard tokens
    loss.backward()
    optimizer.step()
    criterion.step()
```

### 3. Native Integration (e.g. MoE/SRDE)

For architectures like SRDE that require static tensor shapes, you can use `CGGRScorer` to generate a routing mask instead of splitting the batch.

```python
from cggr import CGGRScorer

# 1. Initialize Scorer
self.scorer = CGGRScorer(router, min_tokens_ratio=0.5)

# 2. Get Mask
difficulty, mask, info = self.scorer(input_ids)

# 3. Apply Mask (Null Routing)
# mask is Boolean: True=Hard (Route to Expert), False=Easy (Skip/Null)
expert_output = expert_layer(x) * mask.unsqueeze(-1)
```


## Scoring Strategies

| Strategy   | Description               | Best For            |
| ---------- | ------------------------- | ------------------- |
| `entropy`  | High entropy = hard       | General training    |
| `margin`   | Small top-2 margin = hard | Classification      |
| `loss`     | High loss = hard          | Direct optimization |
| `combined` | All signals combined      | Best overall        |

## Selection Strategies

| Strategy         | Description                    | Benefit             |
| ---------------- | ------------------------------ | ------------------- |
| `topk`           | Top-k hardest tokens           | Simple, fast        |
| `stratified`     | Sample from difficulty buckets | Prevents forgetting |
| `sequence_aware` | Ensure coverage per sequence   | Preserves structure |

## Dynamic Thresholding

Automatically adjusts token ratio based on batch confidence:
- Low confidence → more tokens (model is learning)
- High confidence → fewer tokens (model has converged)

```python
CGGRLoss(dynamic_threshold=True, threshold_sensitivity=0.5)
```

## Full API

```python
CGGRLoss(
    # Scoring
    scoring='combined',
    
    # Selection
    selection='topk',
    num_strata=4,                  # For stratified
    min_tokens_per_sequence=1,     # For sequence_aware
    
    # Thresholding
    dynamic_threshold=True,
    threshold_sensitivity=0.5,
    
    # Curriculum
    min_tokens_ratio=0.25,
    warmup_steps=1000,
)
```

## Performance

| Config                | Backward FLOPs | Overhead |
| --------------------- | -------------- | -------- |
| Standard Loss         | 100%           | 0%       |
| **CGGR (25% tokens)** | **~25%**       | **~0%**  |
