Metadata-Version: 2.4
Name: textpolicy
Version: 0.1.1
Summary: Reinforcement learning for text generation on MLX (Apple Silicon): GRPO/GSPO, environments, rollout, rewards, LoRA/QLoRA
Project-URL: Homepage, https://github.com/teilomillet/textpolicy
Project-URL: Repository, https://github.com/teilomillet/textpolicy
Project-URL: Documentation, https://github.com/teilomillet/textpolicy#readme
Project-URL: Changelog, https://github.com/teilomillet/textpolicy/blob/main/CHANGELOG.md
Keywords: reinforcement-learning,text-generation,mlx,apple-silicon,lora,qlora,grpo,gspo,rlhf
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: MacOS
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=2.3.2
Requires-Dist: mlx>=0.21.0
Requires-Dist: mlx-lm>=0.21.0
Requires-Dist: gymnasium>=0.29.0
Requires-Dist: psutil>=7.0.0
Requires-Dist: wandb>=0.21.1
Requires-Dist: aiohttp>=3.12.15
Requires-Dist: pytest>=8.4.1
Provides-Extra: external
Requires-Dist: aiohttp>=3.8.0; extra == "external"
Requires-Dist: pydantic>=2.0.0; extra == "external"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# TextPolicy

Reinforcement learning toolkit for text generation on MLX (Apple Silicon).
TextPolicy provides algorithms (GRPO/GSPO), text-generation environments, a rollout runner,
reward functions with a decorator registry, and LoRA/QLoRA utilities.

## Install (uv)

```bash
uv add textpolicy
```

Optional model integration:

```bash
uv add mlx mlx-lm
```

## Quickstart

Working example using a real model and tokenizer (mlx-lm required):

```python
import mlx.core as mx
import textpolicy as tp
from textpolicy import load_model, create_policy
from textpolicy.environment.text_generation import TextGenerationEnv
from textpolicy.rollout import RolloutRunner, create_strategy

# 1) Load model and tokenizer (mlx-lm)
model, tokenizer = load_model("Qwen/Qwen3-0.6B")

# 2) Create a policy (controls generation)
generation_params = {"max_tokens": 25, "temperature": 0.7}
policy_fn = create_policy(model, tokenizer, generation_params)

# 3) Define a reward function (env uses this to score responses)
@tp.reward
def length_reward(prompt: str, completion: str, example: dict, **kwargs) -> float:
    return float(len(completion.split()))

# 4) Create an environment (requires a tokenizer)
env = TextGenerationEnv(["What is AI?"], length_reward, tokenizer=tokenizer)

# 5) Collect one rollout step
strategy = create_strategy('grpo')
runner = RolloutRunner(env, policy=policy_fn, strategy=strategy, max_steps=1)
buffer = runner.collect()
print(len(buffer.episodes))
```

Docs:
- Quickstart: `docs/QUICKSTART_UV.md`
- LoRA/QLoRA: `docs/10_lora_qlora.md`
- Full index: `docs/index.md`

FAQ:
- Do I need a model? 
    - Yes for generation with `create_policy`. 
    Use `load_model()` (mlx‑lm) to get `(model, tokenizer)`. 
    For reward‑only code (no generation), a model is not required.
- Do I need a tokenizer? 
    - Yes. 
    Both `TextGenerationEnv` and `TextGenerationEnvironment` require a tokenizer. 
    `load_model()` returns one for mlx‑lm models.
- How do I control generation? 
    - Pass `generation_params` to `create_policy` (for example, `max_tokens`, `temperature`, `top_p`, `repetition_penalty`).
- What does `step()` return? 
    - A dict with `observation`, `reward`, `terminated`, `truncated`, `info`. The runner enforces this.

Examples:
- 01–06: reward functions, batch processing, minimal training
- 08: GRPO training with rollout + buffer
- 09–10: length reduction (GRPO/GSPO)
- 11: LoRA/QLoRA configuration
