Metadata-Version: 2.4
Name: RadEval
Version: 0.1.7
Summary: All-in-one metrics for evaluating AI-generated radiology text
Home-page: https://github.com/jbdel/RadEval
Author: Jean-Benoit Delbrouck, Justin Xu, Xi Zhang
Maintainer: Xi Zhang, JB Delbrouck
License: MIT
Project-URL: Bug Reports, https://github.com/jbdel/RadEval/issues
Project-URL: Source, https://github.com/jbdel/RadEval
Project-URL: Documentation, https://github.com/jbdel/RadEval/blob/main/README.md
Keywords: radiology,evaluation,natural language processing,radiology report,medical NLP,clinical text generation,LLM,bioNLP,chexbert,radgraph,medical AI
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: torchvision
Requires-Dist: transformers<5,>=4.40
Requires-Dist: radgraph
Requires-Dist: rouge_score
Requires-Dist: bert-score>=0.3.13
Requires-Dist: scikit-learn>=1.8.0
Requires-Dist: numpy<2
Requires-Dist: medspacy
Requires-Dist: stanza
Requires-Dist: pillow
Requires-Dist: sentencepiece
Requires-Dist: datasets>=2.19
Requires-Dist: accelerate>=0.30
Requires-Dist: pandas
Requires-Dist: rich
Requires-Dist: pyyaml
Requires-Dist: appdirs
Requires-Dist: huggingface_hub
Provides-Extra: api
Requires-Dist: google-genai; extra == "api"
Requires-Dist: openai; extra == "api"
Requires-Dist: tenacity; extra == "api"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: maintainer
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# RadEval

<div align="center">

**All-in-one metrics for evaluating AI-generated radiology text**

</div>

<!--- BADGES: START --->
[![PyPI](https://img.shields.io/badge/RadEval-v0.0.6-00B7EB?logo=python&logoColor=00B7EB)](https://pypi.org/project/RadEval/)
[![Python version](https://img.shields.io/badge/python-3.11+-important?logo=python&logoColor=important)]()
[![Expert Dataset](https://img.shields.io/badge/Expert-%20Dataset-4CAF50?logo=googlecloudstorage&logoColor=9BF0E1)](https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset)
[![Model](https://img.shields.io/badge/Model-RadEvalModernBERT-0066CC?logo=huggingface&labelColor=grey)](https://huggingface.co/IAMJB/RadEvalModernBERT)
[![Video](https://img.shields.io/badge/Talk-Video-9C27B0?logo=youtubeshorts&labelColor=grey)](https://justin13601.github.io/files/radeval.mp4)
[![Gradio Demo](https://img.shields.io/badge/Gradio-Demo-FFD21E.svg?logo=gradio&logoColor=gold)](https://huggingface.co/spaces/X-iZhang/RadEval)
[![EMNLP](https://img.shields.io/badge/paper-EMNLP-red)](https://aclanthology.org/2025.emnlp-demos.40/)
[![License](https://img.shields.io/badge/License-MIT-blue.svg?)](https://github.com/jbdel/RadEval/main/LICENSE)
<!--- BADGES: END --->


### TL;DR
```
pip install -e .
```
```python
from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
```
```json
{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}
```

## Installation

```bash
pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for MammoGREEN
```

Or install from source:
```bash
git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'
```

## Supported Metrics

| Category | Metric | Flag | Modality | Best For | Usage |
|----------|--------|------|----------|----------|-------|
| **Lexical** | [BLEU](https://aclanthology.org/P02-1040.pdf) | `do_bleu` | -- | Surface-level n-gram overlap | [docs](docs/metrics.md#bleu-do_bleu) |
| | [ROUGE](https://aclanthology.org/W04-1013.pdf) | `do_rouge` | -- | Content coverage | [docs](docs/metrics.md#rouge-do_rouge) |
| **Semantic** | [BERTScore](https://openreview.net/forum?id=SkeHuCVFDr) | `do_bertscore` | -- | Semantic similarity | [docs](docs/metrics.md#bertscore-do_bertscore) |
| | [RadEval BERTScore](https://aclanthology.org/2025.emnlp-demos.40.pdf) | `do_radeval_bertscore` | -- | Domain-adapted radiology semantics | [docs](docs/metrics.md#radeval-bertscore-do_radeval_bertscore) |
| **Clinical** | [F1CheXbert](https://aclanthology.org/2020.emnlp-main.117.pdf) | `do_f1chexbert` | CXR | CheXpert finding classification | [docs](docs/metrics.md#f1chexbert-do_f1chexbert) |
| | [F1RadBERT-CT](https://www.nature.com/articles/s41551-025-01599-y) | `do_f1radbert_ct` | CT | CT finding classification | [docs](docs/metrics.md#f1radbert-ct-do_f1radbert_ct) |
| | [F1RadGraph](https://aclanthology.org/2022.findings-emnlp.319.pdf) | `do_radgraph` | CXR | Clinical entity/relation accuracy | [docs](docs/metrics.md#f1radgraph-do_radgraph) |
| | [RaTEScore](https://aclanthology.org/2024.emnlp-main.836.pdf) | `do_ratescore` | CXR | Entity-level synonym-aware scoring | [docs](docs/metrics.md#ratescore-do_ratescore) |
| **Specialized** | [RadGraph-RadCliQ](https://www.cell.com/patterns/pdfExtended/S2666-3899(23)00157-5) | `do_radgraph_radcliq` | CXR | Per-pair entity+relation F1 (RadCliQ variant) | [docs](docs/metrics.md#radgraph-radcliq-do_radgraph_radcliq) |
| | [RadCliQ-v1](https://www.cell.com/patterns/pdfExtended/S2666-3899(23)00157-5) | `do_radcliq` | CXR | Composite clinical relevance | [docs](docs/metrics.md#radcliq-v1-do_radcliq) |
| | [SRRBert](https://aclanthology.org/2025.acl-long.1301.pdf) | `do_srrbert` | CXR | Structured report evaluation | [docs](docs/metrics.md#srrbert-do_srrbert) |
| | [Temporal F1](https://aclanthology.org/2025.findings-acl.888.pdf) | `do_temporal` | CXR | Temporal consistency | [docs](docs/metrics.md#temporal-f1-do_temporal) |
| | [GREEN](https://aclanthology.org/2024.findings-emnlp.21.pdf) | `do_green` | CXR | LLM-based overall quality (7B model) | [docs](docs/metrics.md#green-do_green) |
| | MammoGREEN | `do_mammo_green` | Mammo | Mammography-specific LLM scoring | [docs](docs/metrics.md#mammogreen-do_mammo_green) |
| | [CRIMSON](https://arxiv.org/pdf/2603.06183) | `do_crimson` | CXR | LLM-based clinical significance scoring | [docs](docs/metrics.md#crimson-do_crimson) |
| | [RadFact-CT](https://arxiv.org/pdf/2510.15042) | `do_radfact_ct` | CT | LLM-based factual precision/recall | [docs](docs/metrics.md#radfact-ct-do_radfact_ct) |

> **Modality:** CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.

Enable only the metrics you need -- each one is loaded lazily.

## Per-Sample Output

Pass `do_per_sample=True` to get per-sample scores for every enabled metric. The output uses the **same flat keys** as the default mode, but each value is a `list[float]` of length `n_samples` instead of a single aggregate.

```python
evaluator = RadEval(do_bleu=True, do_bertscore=True, do_per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]      → [0.85, 0.40, ...]   (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]
```

See [docs/metrics.md](docs/metrics.md) for the full list of per-sample output keys for each metric.

## Detailed Output

Pass `do_details=True` to get additional aggregate scores beyond the defaults: per-label F1 breakdowns for classifiers, BLEU-1/2/3, standard deviations for LLM-based metrics. Same flat keys as default, no nesting.

```python
evaluator = RadEval(do_bleu=True, do_f1chexbert=True, do_crimson=True, do_details=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]       → 0.36     (same as default)
# results["bleu_1"]     → 0.55     (extra: BLEU-1)
# results["bleu_2"]     → 0.42     (extra: BLEU-2)
# results["crimson_std"] → 0.15    (extra: std)
# results["f1chexbert_label_scores_f1"] → {"f1chexbert_5": {"Cardiomegaly": 0.59, ...}, ...}
```

See [docs/metrics.md](docs/metrics.md) for the full output schema of each metric.

## Comparing Systems

Use `compare_systems` to run paired approximate randomization tests between any number of systems:

```python
from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)
```

See [docs/hypothesis_testing.md](docs/hypothesis_testing.md) for a full walkthrough and interpretation guide.

## Documentation

| Page | Contents |
|------|----------|
| [docs/metrics.md](docs/metrics.md) | What each metric measures, `do_per_sample` / `do_details` output schemas |
| [docs/hypothesis_testing.md](docs/hypothesis_testing.md) | Statistical background, full example, performance notes |
| [docs/file_formats.md](docs/file_formats.md) | Loading data from .tok, .json, and Python lists |

## RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on [HuggingFace](https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset).

## Citation

```BibTeX
@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}
```

### Contributors
<table>
  <tbody>
    <tr>
      <td align="center">
        <a href="https://jbdel.github.io/">
          <img src="https://aimi.stanford.edu/sites/g/files/sbiybj20451/files/styles/medium_square/public/media/image/image5_0.png?h=f4e62a0a&itok=euaj9VoF"
               width="100" height="100"
               style="object-fit: cover; border-radius: 20%;" alt="Jean-Benoit Delbrouck"/>
          <br />
          <sub><b>Jean-Benoit Delbrouck</b></sub>
        </a>
      </td>
      <td align="center">
        <a href="https://justin13601.github.io/">
          <img src="https://justin13601.github.io/images/pfp2.JPG"
               width="100" height="100"
               style="object-fit: cover; border-radius: 20%;" alt="Justin Xu"/>
          <br />
          <sub><b>Justin Xu</b></sub>
        </a>
      </td>
      <td align="center">
        <a href="https://x-izhang.github.io/">
          <img src="https://x-izhang.github.io/author/xi-zhang/avatar_hu13660783057866068725.jpg"
               width="100" height="100"
               style="object-fit: cover; border-radius: 20%;" alt="Xi Zhang"/>
          <br />
          <sub><b>Xi Zhang</b></sub>
        </a>
      </td>
    </tr>
  </tbody>
</table>

## Acknowledgments

Built on the work of the radiology AI community: [CheXbert](https://github.com/stanfordmlgroup/CheXbert), [RadGraph](https://github.com/jbdel/RadGraph), [BERTScore](https://github.com/Tiiiger/bert_score), [RaTEScore](https://github.com/MAGIC-AI4Med/RaTEScore), [SRR-BERT](https://github.com/StanfordAIMI/SRR-BERT), [GREEN](https://github.com/Stanford-AIMI/GREEN), and datasets like [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.0.0/).

---
<div align="center">
  <p>If you find RadEval useful, please give us a star!</p>
</div>
