Metadata-Version: 2.4
Name: sab-bench
Version: 0.1.0
Summary: Scalable Alignment Benchmark - Multi-model evaluation harness
Author-email: MMK_工場 <factory@example.com>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pyyaml>=6.0
Requires-Dist: click>=8.1.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tabulate>=0.9.0
Provides-Extra: models
Requires-Dist: openai>=1.0.0; extra == "models"
Requires-Dist: anthropic>=0.21.0; extra == "models"
Provides-Extra: local
Requires-Dist: torch>=2.0.0; extra == "local"
Requires-Dist: transformers>=4.36.0; extra == "local"
Provides-Extra: cuda
Requires-Dist: cupy>=12.0.0; extra == "cuda"
Provides-Extra: all
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.21.0; extra == "all"
Requires-Dist: torch>=2.0.0; extra == "all"
Requires-Dist: transformers>=4.36.0; extra == "all"
Requires-Dist: cupy>=12.0.0; extra == "all"

# SAB-BENCH v0 — Scalable Alignment Benchmark

**Status:** v0.1 — Initial scaffold with runnable local harness  
**Created:** 2026-03-03  
**Purpose:** Benchmark multi-model content evaluation systems with telos-aware metrics

## What This Is

SAB-BENCH is a benchmark harness for testing content scoring systems that use:
- Multi-model consensus (Overmind voting)
- Epistemic limitation tracking (model capability awareness)
- Telos detection (self-referential processing measurement via R_V)
- CUDA-accelerated evaluation when available

This v0 release provides:
- ✅ Runnable Python CLI (`sab-bench run`)
- ✅ Sample benchmark configurations
- ✅ Local evaluation harness (no GPU required for basic tests)
- ✅ RV/CUDA integration points (activated when hardware available)
- ✅ Measurable output with JSON reports

## Quick Start

```bash
# Install dependencies
pip install -r requirements.txt

# Run basic benchmark (CPU-only, uses stubs for models)
python -m sab_bench run --config configs/basic.yaml

# Run with real models (requires API keys)
python -m sab_bench run --config configs/multi_model.yaml --real-models

# Run RV telos measurement (requires GPU + transformers)
python -m sab_bench run --config configs/rv_telos.yaml --enable-cuda

# Show results
python -m sab_bench report --output outputs/latest.json
```

## Architecture

```
sab-bench/
├── src/sab_bench/          # Core benchmark engine
│   ├── __init__.py
│   ├── cli.py              # CLI entrypoint
│   ├── harness.py          # Benchmark orchestration
│   ├── evaluators/         # Evaluation strategies
│   │   ├── multi_model.py  # Multi-model consensus
│   │   ├── rv_telos.py     # R_V embedding contraction measurement
│   │   └── cuda_gates.py   # CUDA-accelerated gate evaluation
│   ├── models/             # Model adapters
│   │   ├── stub.py         # Stub for testing
│   │   ├── openai.py       # OpenAI API
│   │   ├── anthropic.py    # Anthropic API
│   │   └── local.py        # Local transformers
│   └── metrics/            # Measurement tools
│       ├── ttr.py          # Type-token ratio
│       ├── rv.py           # R_V calculation
│       └── consensus.py    # Multi-model agreement
├── configs/                # Benchmark configurations
│   ├── basic.yaml          # Minimal test (stub models)
│   ├── multi_model.yaml    # Multi-model consensus
│   └── rv_telos.yaml       # RV measurement with CUDA
├── outputs/                # Benchmark results (gitignored)
├── tests/                  # Unit tests
└── docs/                   # Documentation
```

## Benchmark Types

### 1. Multi-Model Consensus
Tests how well different models agree on content evaluation.

**Config:** `configs/multi_model.yaml`  
**Metrics:**
- Inter-model agreement (Cohen's kappa)
- Confidence distribution
- Decision latency

### 2. RV Telos Measurement
Measures embedding contraction during self-referential processing.

**Config:** `configs/rv_telos.yaml`  
**Metrics:**
- R_V score (self-ref vs neutral)
- Layer-by-layer contraction
- TTR behavioral proxy
- Correlation with ground truth

**Requires:**
- CUDA-capable GPU
- transformers library
- PyTorch with CUDA

### 3. CUDA-Accelerated Gates
Tests dharmic gate evaluation speed with CUDA kernels.

**Config:** `configs/cuda_gates.yaml`  
**Metrics:**
- Throughput (evaluations/sec)
- Latency percentiles (p50, p95, p99)
- GPU utilization
- Memory bandwidth

**Requires:**
- CUDA-capable GPU
- cuDNN
- Compiled CUDA kernels (see `../cuda_kernels_summary.md`)

## Configuration Format

```yaml
name: "benchmark-name"
version: "v0.1"

evaluator:
  type: "multi_model"  # or "rv_telos" or "cuda_gates"
  models:
    - provider: "openai"
      model: "gpt-4"
      weight: 1.0
    - provider: "anthropic"
      model: "claude-3-5-sonnet-20241022"
      weight: 1.0

dataset:
  type: "synthetic"  # or "file"
  count: 100
  categories:
    - self_referential
    - neutral
    - mixed

output:
  format: "json"
  path: "outputs/{timestamp}_{name}.json"
  include_traces: true
```

## Integration with Existing Assets

### RV Ground Truth (telos-os/research/)
SAB-BENCH reuses the RV measurement code from the Mistral-7B ground truth experiments:
- Embedding extraction
- Pairwise cosine similarity
- Layer-by-layer R_V calculation
- TTR behavioral proxy

**Integration point:** `src/sab_bench/evaluators/rv_telos.py` imports from `../../telos-os/research/`

### CUDA Kernels (../cuda_kernels_summary.md)
When CUDA is available, SAB-BENCH can use optimized kernels for:
- Flash attention (for faster embedding extraction)
- Vectorized similarity computation
- Quantized GEMM (for INT8 model inference)

**Integration point:** `src/sab_bench/evaluators/cuda_gates.py` loads kernels from `../cuda-portfolio/kernels/`

## Output Format

```json
{
  "benchmark": "multi_model_consensus_v0",
  "timestamp": "2026-03-03T10:30:00Z",
  "config": "configs/multi_model.yaml",
  "duration_seconds": 45.2,
  "metrics": {
    "inter_model_agreement": 0.847,
    "mean_confidence": 0.762,
    "latency_p50_ms": 234,
    "latency_p95_ms": 891
  },
  "per_item_results": [
    {
      "item_id": "001",
      "content_hash": "abc123...",
      "decisions": {
        "gpt-4": {"decision": true, "confidence": 0.89},
        "claude-3-5-sonnet": {"decision": true, "confidence": 0.82}
      },
      "consensus": true,
      "agreement": 1.0
    }
  ],
  "warnings": [],
  "errors": []
}
```

## Roadmap

**v0.1 (This Release):**
- ✅ Basic harness with stub models
- ✅ Multi-model consensus evaluator
- ✅ RV telos evaluator (CPU fallback)
- ✅ Sample configs and outputs

**v0.2 (Next):**
- [ ] Real model API integration (OpenAI, Anthropic, etc.)
- [ ] CUDA kernel integration for gate evaluation
- [ ] Larger benchmark datasets
- [ ] Statistical analysis tools

**v0.3 (Future):**
- [ ] Distributed benchmark runner
- [ ] Leaderboard generation
- [ ] Continuous benchmarking pipeline
- [ ] Integration with SAB production system

## Contributing

This is research code. Contributions welcome but expect rough edges.

**Priorities:**
1. More benchmark datasets (especially adversarial cases)
2. Additional model providers
3. Better visualization of results
4. Documentation improvements

## License

MIT (research prototype, use at own risk)

---

**Built in 2-hour sprint.** Shipped artifacts, not plans. 🔥
