Metadata-Version: 2.4
Name: autorubric
Version: 0.3.2
Summary: A Python library encapsulating best practices for rubric-based evaluation of LLM/VLM outputs using LLM-as-a-judge.
Project-URL: Homepage, https://github.com/delip/autorubric
Project-URL: Repository, https://github.com/delip/autorubric
Project-URL: Issues, https://github.com/delip/autorubric/issues
Author: Delip Rao
License-Expression: MIT
License-File: LICENSE
Keywords: autorubric,eval,evaluation,grading,llm,llm as a judge,llm judge,rubrics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Education
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: coolname>=2.0.0
Requires-Dist: diskcache>=5.6.0
Requires-Dist: litellm>=1.50.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: tenacity>=8.2.0
Description-Content-Type: text/markdown


<p align="center">
  <a href="https://badge.fury.io/py/autorubric">
    <img src="https://badge.fury.io/py/autorubric.svg" alt="PyPI version" />
  </a>
  <a href="https://pypi.python.org/pypi/autorubric">
    <img src="https://img.shields.io/pypi/pyversions/autorubric.svg" alt="Python versions" />
  </a>
  <a href="https://autorubric.org">
    <img src="https://img.shields.io/badge/site-autorubric.org-blue.svg" alt="Website" />
  </a>
  <a href="https://arxiv.org/abs/2603.00077">
    <img src="https://img.shields.io/badge/arXiv-2603.00077-<COLOR>.svg" alt="arXiv" />
  </a>
</p>

# AutoRubric

A Python library for evaluating text outputs against weighted criteria using LLM-as-a-judge.

```bibtex
  @misc{rao2026autorubric,
        title={Autorubric: A Unified Framework for Rubric-Based LLM Evaluation},
        author={Delip Rao and Chris Callison-Burch},
        year={2026},
        eprint={2603.00077},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2603.00077},
  }
```

---


## Installation

```bash
pip install autorubric
```

## Quick Example

```python
import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

async def main():
    grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-5.1-mini"))

    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"},
        {"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"},
        {"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"},
        {"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"}
    ])

    result = await rubric.grade(
        to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level,
        while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior
        thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC,
        and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""",
        grader=grader,
        query="Compare NMC and LFP cathode materials for EV battery applications.",
    )

    print(f"Score: {result.score:.2f}")
    for criterion in result.report:
        print(f"  [{criterion.final_verdict}] {criterion.criterion.requirement}")

asyncio.run(main())
```

## Documentation

Full documentation, API reference, and a cookbook with several dozen recipes are available at **[autorubric.org](https://autorubric.org/)**.

| Resource      | Link                                                                  |
| ------------- | --------------------------------------------------------------------- |
| Project site  | [autorubric.org](https://autorubric.org)                              |
| API reference | [autorubric.org/docs/api](https://autorubric.org/docs/api/)           |
| Cookbook      | [autorubric.org/docs/cookbook](https://autorubric.org/docs/cookbook/) |

## Features

| Feature                    | Description                                                              |
| -------------------------- | ------------------------------------------------------------------------ |
| Weighted criteria          | Positive and negative weights with explicit requirements                 |
| Per-criterion explanations | Every verdict includes the judge's reasoning                             |
| 100+ LLM providers         | OpenAI, Anthropic, Google, Azure, Groq, Ollama, and more via LiteLLM     |
| Ensemble judging           | Combine multiple LLM judges with configurable aggregation strategies     |
| Few-shot calibration       | Provide labeled examples to improve grading consistency                  |
| Multi-choice criteria      | Ordinal and nominal scales beyond binary met/unmet verdicts              |
| Batch evaluation           | High-throughput `EvalRunner` with checkpointing and resumption           |
| Metrics & validation       | Agreement metrics, bootstrap confidence intervals, distribution analysis |
| Length penalty             | Configurable penalty for overly long responses                           |
| Thinking/reasoning support | Budget-controlled extended thinking for supported models                 |
| Response caching           | Disk-based caching to avoid redundant LLM calls                          |
| Dataset support            | Structured datasets with per-item rubrics, prompts, and ground truth     |
| YAML configuration         | Define rubrics, LLM configs, and datasets in YAML                        |
| Meta-rubric evaluation     | Evaluate and automatically improve rubric quality                        |

## License

MIT License - see LICENSE file for details.

## Acknowledgments

This research was developed with funding from the Defense Advanced Research Projects Agency’s (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed
are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.