Metadata-Version: 2.4
Name: rewardprobe
Version: 0.1.0
Summary: Pre-training stress-testing for reward functions. Find bugs in minutes on CPU instead of days into a $10K training run.
Project-URL: Homepage, https://github.com/rewardprobe/rewardprobe
Project-URL: Documentation, https://github.com/rewardprobe/rewardprobe
Project-URL: Repository, https://github.com/rewardprobe/rewardprobe
Author: rewardprobe contributors
License-Expression: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Provides-Extra: all
Requires-Dist: trl>=0.15.0; extra == 'all'
Requires-Dist: verifiers>=0.1.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Provides-Extra: trl
Requires-Dist: trl>=0.15.0; extra == 'trl'
Provides-Extra: verifiers
Requires-Dist: verifiers>=0.1.0; extra == 'verifiers'
Description-Content-Type: text/markdown

# rewardprobe

**Know what your model will learn — before you train.**

[![PyPI](https://img.shields.io/pypi/v/rewardprobe)](https://pypi.org/project/rewardprobe/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)

---

You write a reward function. You're about to spend $10K on a GRPO training run. rewardprobe tells you what the model will actually learn to do:

```
rewardprobe simulate — production_math_rlvr
  50 completions across 5 tasks

  2 critical found

  1.  critical
     'Shortcut' strategy scores 0.71
     A model using the shortcut strategy earns 103% of what a correct
     answer earns. It will learn to skip computation and take shortcuts
     because that's easier AND scores higher.

  2.  critical
     'Lazy correct' strategy scores only 0.07
     A correct answer without formatting scores near zero. Your reward
     function punishes correct-but-unformatted answers more than it
     punishes wrong-but-formatted ones.

  Strategy scoreboard:
    perfect              ████████████████████ 1.00
    correct_verbose      ████████████████████ 1.00
    shortcut             ██████████████░░░░░░ 0.71  ← problem
    near_miss            █████░░░░░░░░░░░░░░░ 0.29
    format_only          █████░░░░░░░░░░░░░░░ 0.29
    garbage              ███░░░░░░░░░░░░░░░░░ 0.18
    correct_lazy         █░░░░░░░░░░░░░░░░░░░ 0.07  ← problem
```

The **strategy scoreboard** shows exactly how your reward function scores different model behaviors. If a lazy or wrong strategy scores close to a correct one, the model will learn the lazy path. You see this in 30 seconds instead of discovering it 3 days into training.

---

## The Problem

You write a reward function for RL training. It looks correct. You start training. Days later, the model is gaming the reward — outputting shortcuts, copying format without thinking, or guessing. OpenAI [documented](https://openai.com/index/chain-of-thought-monitoring/) this happening with `exit(0)` and `raise SkipTest`. METR [found](https://metr.org/blog/2025-06-05-recent-reward-hacking/) frontier models monkey-patching their own graders.

The fix is to test reward functions **before** training, the same way you test code before deploying.

---

## Install

```bash
pip install rewardprobe
```

---

## Three Modes

### 1. Quick Check (free, instant, no API key)

30 deterministic probes. Catches parser bugs, edge cases, format tricks. Runs in under a second on CPU.

```bash
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl
```

```
rewardprobe — my_reward

  1 critical, 2 warning found

  1.  critical
     Correct answer in reasoning section scores 1.0 even when the
     answer field contains a wrong answer.

  2.  warning
     Different scores depending on answer tag order.

  28/30 checks passed.
```

### 2. Deep Analysis (needs API key)

Claude reads your source code, understands what each function does, and generates realistic adversarial completions. Finds bugs that static probes can't.

```bash
export ANTHROPIC_API_KEY=sk-...
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --deep
```

This adds:
- **Code analysis** — Claude identifies logic bugs by reading your Python code
- **Adversarial completions** — wrong-but-plausible model outputs tested against your function
- **False positive filtering** — classifies each function (correctness/format/auxiliary) so findings are precise

### 3. Simulate (needs API key)

The flagship feature. Generates diverse completions spanning the full range of what a model might produce during training — from perfect solutions to garbage — and maps the reward landscape.

```bash
rewardprobe simulate my_reward.py::my_fn --dataset tasks.jsonl
```

The strategy scoreboard shows you at a glance:
- **Green strategies** (perfect, correct_lazy, correct_verbose) — what you WANT the model to learn
- **Red strategies** (shortcut, format_only, hedge, garbage) — what you DON'T want

If a red strategy scores close to or higher than a green one, your reward function has a problem.

---

## What We Found

We ran rewardprobe against reward functions from 4 major RL codebases plus 3 non-math domains. Results:

| Codebase | Domain | Key Finding |
|----------|--------|-------------|
| **verifiers/gsm8k** (Prime Intellect) | Math | Model can skip reasoning — `correct_lazy` scores 1.0 |
| **Open-R1** (HuggingFace) | Math | `first_match` mode lets models hedge with multiple answers |
| **verl** (ByteDance) | Math | `format_score` parameter can reward wrong answers |
| **willccbb GRPO gist** | Math | Returns 2.0 (outside [0,1]); rejects "42.0" for "42" |
| Custom code reward | Code | Off-by-one bugs score 0.83 — substring matching misses logic errors |
| Sentiment classifier | Text | Reasoned answers score 0.0, bare labels score 1.0 |

---

## Works With Any Framework

Auto-detects your reward function's signature. No configuration.

```python
# Any of these just work:
def my_reward(completion, answer): ...                     # Raw Python
def accuracy_reward(completions, solution, **kwargs): ...  # TRL / GRPO
def correctness(prompts, completions, answer, **kwargs): ... # TRL with prompts
async def correct_answer(completion, answer): ...          # verifiers
def compute_score(solution_str, ground_truth): ...         # ByteDance verl
```

```bash
rewardprobe test file.py::fn --dataset tasks.jsonl    # Just works
rewardprobe test environments/gsm8k.py                # verifiers environments too
```

---

## GitHub Action

```yaml
- run: pip install rewardprobe
- run: rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --ci
```

Exit code 1 on critical findings. Add `--deep` with `ANTHROPIC_API_KEY` secret for AI analysis in CI.

---

## Python API

```python
from rewardprobe import Probe

# Quick check
report = Probe().test_fn(my_reward, tasks)
print(report.passed)  # True / False

# Deep analysis
report = Probe(deep=True).test_fn(my_reward, tasks)

# Simulate
from rewardprobe.simulator import simulate, print_simulation
from rewardprobe.tier2.client import get_client
from rewardprobe.adapters.auto import auto_adapt

env = auto_adapt(my_reward, tasks)
result = simulate(env, get_client("sonnet"), n_tasks=5)
print_simulation(result)
```

---

## How It Works

**Quick Check** generates adversarial inputs (empty strings, format tricks, parser exploits, wrong-but-formatted answers) and tests your reward function against them. 30 probes across 6 families, all deterministic, all on CPU.

**Deep Analysis** uses Claude to read your reward function's Python source code. It understands what the function checks, identifies logic bugs, and generates realistic wrong completions that a model might produce during training. Each completion is actually run against your function — only real exploits are reported.

**Simulate** uses Claude to generate 10 diverse completions per task, each representing a different strategy a model might learn (perfect, lazy, shortcut, hedging, garbage, etc). Scores them all against your reward function. The strategy scoreboard shows which behaviors your reward function actually incentivizes.

---

## What rewardprobe Is NOT

- **Not a training monitor.** We run *before* training starts.
- **Not a formal prover.** We find bugs empirically with concrete inputs.
- **Not a guarantee.** A clean report means "we tested these patterns and found nothing." The nastiest reward hacks are novel and environment-specific.

---

## Contributing

See [CLAUDE.md](CLAUDE.md) for architecture, how to add attacks, and how the simulator works.

```bash
git clone https://github.com/rewardprobe/rewardprobe && cd rewardprobe
uv sync --extra dev && pytest tests/
```

Apache 2.0
