Metadata-Version: 2.4
Name: py-openjudge
Version: 0.1.7
Summary: OpenJudge: A Next-Generation Evaluation System for AI Model Assessment
Author-email: Haoran Chen <chenhaoran.chr@alibaba-inc.com>, Yuhao Cui <cyh262498@alibaba-inc.com>, Jiaji Deng <dengjiaji.djj@alibaba-inc.com>, Qingxu Fu <fuqingxu.fqx@alibaba-inc.com>, Yuan Gao <yunze.gy@alibaba-inc.com>, Sen Huang <huangsen.huang@alibaba-inc.com>, Li Yu <jinli.yl@alibaba-inc.com>, Boyin Liu <liuboyin.lby@alibaba-inc.com>, Zhaoyang Liu <jingmu.lzy@alibaba-inc.com>, Yunzhou Shi <yunzhou.syz@alibaba-inc.com>, Lipeng Xie <xielipeng.xlp@alibaba-inc.com>, Yunpeng Zhai <zhaiyunpeng.zyp@alibaba-inc.com>, Zhuo Zhang <zz297429@alibaba-inc.com>, Anni Zou <zouanni.zan@alibaba-inc.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/modelscope/OpenJudge
Project-URL: Repository, https://github.com/modelscope/OpenJudge
Project-URL: Documentation, https://modelscope.github.io/OpenJudge/
Keywords: deep-learning,evaluation,ai-model,llm
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<3.0.0,>=2.2.3
Requires-Dist: loguru<0.8.0,>=0.7.3
Requires-Dist: json_repair<1.0.0,>=0.54.0
Requires-Dist: pydantic<3.0.0,>=2.11.5
Requires-Dist: openai<2.0.0,>=1.85.0
Requires-Dist: tenacity<10.0.0,>=9.1.0
Requires-Dist: math-verify<0.8.0,>=0.7.0
Requires-Dist: tqdm<5.0.0,>=4.66.0
Requires-Dist: fire
Requires-Dist: numpy<2.0.0,>=1.22.0
Requires-Dist: dashscope>=1.19.0
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: nltk>=3.8.1
Requires-Dist: jieba>=0.42.1
Requires-Dist: sacrebleu>=2.0.0
Requires-Dist: rouge-score>=0.1.2
Requires-Dist: python-Levenshtein>=0.20.0
Provides-Extra: dev
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: pytest<9.0.0,>=8.3.5; extra == "dev"
Requires-Dist: sphinx-gallery; extra == "dev"
Requires-Dist: furo; extra == "dev"
Requires-Dist: myst_parser; extra == "dev"
Requires-Dist: anyio; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: pytest-tornasync; extra == "dev"
Requires-Dist: pytest-trio; extra == "dev"
Requires-Dist: pytest-twisted; extra == "dev"
Requires-Dist: twisted; extra == "dev"
Requires-Dist: python-dotenv; extra == "dev"
Provides-Extra: verl
Requires-Dist: transformers<5.0.0,>=4.52.4; extra == "verl"
Requires-Dist: verl; extra == "verl"
Dynamic: license-file

<div align="center">

<p align="center">
  <img src="./docs/images/logo.png" alt="Open-Judge Logo" width="500">
</p>

**Holistic Evaluation, Quality Rewards: Driving Application Excellence**


[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue)](https://pypi.org/project/open_judge/)
[![PyPI](https://img.shields.io/badge/pypi-v0.2.0-blue?logo=pypi)](https://pypi.org/project/open_judge/)
[![Documentation](https://img.shields.io/badge/docs-online-blue?logo=markdown)](https://modelscope.github.io/OpenJudge/)

[Documentation](https://modelscope.github.io/OpenJudge/) | [Contributing](./docs/community/contributing.md) | [中文](./README_zh.md)

</div>


## News

- **2025-10-20** - [Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling](https://arxiv.org/abs/2510.17314) - We released a new paper on learning generalizable reward criteria for robust modeling.
- **2025-10-17** - [Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning](https://arxiv.org/abs/2510.15514) - We introduced techniques to align judge feedback and improve RL stability.
- **2025-07-09** - Released OpenJudge v0.1.0 on [PyPI](https://pypi.org/project/open_judge/)


Evaluation and reward signals are the cornerstones of application excellence. **Holistic evaluation** enables the systematic analysis of shortcomings to drive rapid iteration, while **high-quality** rewards provide the essential foundation for advanced optimization and fine-tuning.
Open-Judge unifies reward signals and evaluation metrics into one **Grader** interface—with pre-built graders, flexible customization, and seamless framework integration.

## Key Features

<div class="key-features" markdown>

+ **Systematic & Quality-Assured Grader Library**: Access N+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.
    - **Multi-Scenario Coverage:** Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks via specialized graders.
    - **Holistic Agent Evaluation:** Beyond final outcomes, we assess the entire lifecycle—including trajectories and specific components (Memory, Reflection, Tool Use).
    - **Quality Assurance:** Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation.

+ **Flexible Grader Building Methods**: Choose the build method that fits your requirements:
    - **Customization:** Easily extend or modify pre-defined graders to fit your specific needs.
    - **Data-Driven Rubrics:** Have a few examples but no clear rules? Use our tools to automatically generate white-box evaluation criteria (Rubrics) based on your data.
    - **Trainable Judge Models:** For high-scale scenarios, train dedicated Judge models as Graders. We support SFT, **Bradley-Terry models, and Reinforcement Learning** workflows.

+ **Easy Integration**: Seamlessly connect with mainstream evaluation platforms (e.g., LangSmith, LangFuse) and training frameworks (e.g., VERL) using our comprehensive tutorials and flexible APIs.
</div>


## Installation
```bash
pip install open_judge
```
More installation methods can be found in the [here](https://modelscope.cn/docs/open_judge/installation).

## Quickstart
```python
import asyncio
from open_judge.models import OpenAIChatModel
from open_judge.graders.common.relevance import RelevanceGrader


# step1 create model client
model = OpenAIChatModel(model="qwen3-32b")

# step2 choose and initialize proper grader
grader = RelevanceGrader(model=model)

# step3 Prepare data

data = {
    "query": "What is machine learning?",
    "response": "Machine learning is a subset of AI that enables computers to learn from data.",
}

# step 4 Evaluate using the data
result = await grader.aevaluate(**data)

print(f"Score: {result.score}")  # Score: 5
print(f"Reason: {result.reason}")
```
Complete Quickstart can be found in [here](https://modelscope.cn/docs/open_judge/quickstart).

## Integrations

| Integration | Documentation |
|-------------|---------------|
| LangSmith   | [LangSmith](https://modelscope.cn/docs/open_judge/integrations/langsmith) |
| LangFuse    | [LangFuse](https://modelscope.cn/docs/open_judge/integrations/langfuse) |
| Arize Phoenix| [Arize Phoenix](https://modelscope.cn/docs/open_judge/integrations/langfuse) |

## Contributing
We welcome contributions from the community!
1. Raise and comment on [Issues](https://github.com/modelscope/OpenJudge/issues).
2. Open a PR - Whether you're fixing bugs, adding new features, improving documentation, or sharing
ideas, your contributions help make Open-Judge better for everyone. See [Contributing](https://github.com/modelscope/OpenJudge/blob/main/CONTRIBUTING.md) for more details.

## Citation

If you use Open-Judge in your research, please cite:

```
@software{
title = {OpenJudge: XXXX},
author = {The Open-Judge Team},
url = {https://github.com/modelscope/Open-Judge},
month = {07},
year = {2025}
}
```
