Metadata-Version: 2.2
Name: llm-rewards
Version: 0.0.1
Summary: Lean, modular reward functions for RL training with LLMs
Home-page: https://github.com/dotpyu/llm-rewards
Author: Peilin Yu
Author-email: peilin_yu@brown.edu
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: accelerate>=0.24.0
Requires-Dist: bitsandbytes>=0.41.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: protobuf>=4.24.0
Requires-Dist: einops>=0.6.1
Requires-Dist: typing-extensions>=4.5.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: black>=23.7.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.1; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.3; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# LLM Rewards

A lean, modular reward functions for RLHF training with LLMs. Framework-agnostic design with built-in support for trlx, trl, and custom training loops.

## Install

```bash
pip install -e .
```

## Quick Start

```python
from llm_rewards import RewardModel, SimpleThinkReward, LengthReward, XMLReward, create_reward_fn

# Create reward stack
rewards = [
    LengthReward(target_length=1024, weight=0.1),
    XMLReward(weight=0.5, partial_credit=True),
    RewardModel("your/reward/model", weight=1.0),
    SimpleThinkReward(weight=0.5) 
]

# Get framework-agnostic reward function
reward_fn = create_reward_fn(rewards, normalize=True)

# Use with trlx
from trlx import Trainer
trainer = Trainer(reward_fn=reward_fn)
trainer.train(...)
```

## Key Features

- Transformer reward models
- Reasoning validation (ThinkingReward)
- Length, format, XML validation
- Reference similarity
- Prompt relevance
- Framework adapters
- Batched inference
- Reward normalization

## Example Training Script

See `example/train_example.py` for full Qwen-2.5 0.5B training example.

## Custom Rewards

```python
from llm_rewards import RewardFunction, RewardOutput

class MyReward(RewardFunction):
    def compute(self, texts, **kwargs) -> RewardOutput:
        rewards = [score(text) for text in texts]
        return RewardOutput(values=torch.tensor(rewards))
```

## License
MIT
