Metadata-Version: 2.1
Name: grouse
Version: 0.2.0
Summary: Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.
Author-email: Sacha Muller <sacha.muller@illuin.tech>, Antonio Loison <antonio.loison@illuin.tech>, Bilel Omrani <bilel.omrani@illuin.tech>
Project-URL: homepage, https://github.com/illuin-tech/grouse
Project-URL: repository, https://github.com/illuin-tech/grouse
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: click<8.1.7,>=8.1.0
Requires-Dist: litellm<2.0.0,>=1.41.0
Requires-Dist: instructor<2.0.0,>=1.3.5
Requires-Dist: diskcache<6.0.0,>=5.6.0
Requires-Dist: numpy<3.0.0,>=1.21.2
Requires-Dist: jsonlines<5.0.0,>=4.0.0
Requires-Dist: datasets==2.20.0
Requires-Dist: Jinja2<4.0.0,>=3.1.0
Requires-Dist: aiohttp<4.0.0,>=3.9.0
Requires-Dist: tqdm<5.0.0,>=4.66.0
Requires-Dist: pydantic<3.0.0,>=2.5.0
Requires-Dist: diskcache<6.0.0,>=5.6.0
Requires-Dist: matplotlib<4.0.0,>=3.9.0
Requires-Dist: importlib-resources<7.0.0,>=6.4.0
Provides-Extra: dev
Requires-Dist: ruff==0.5.4; extra == "dev"
Requires-Dist: deptry==0.17.0; extra == "dev"
Requires-Dist: mypy==1.11.0; extra == "dev"
Requires-Dist: pytest==8.3.1; extra == "dev"
Requires-Dist: coverage==7.6.0; extra == "dev"
Requires-Dist: types-tqdm==4.66.0; extra == "dev"
Requires-Dist: mock==5.1.0; extra == "dev"

# GroUSE

Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.

- [Install](#install)
- [Command Line Usage](#command-line-usage)
  - [Evaluation of the Grounded Question Answering task](#evaluation-of-the-grounded-question-answering-task)
  - [Unit Testing of Evaluators with GroUSE](#unit-testing-of-evaluators-with-grouse)
  - [Plot Matrices of unit tests success](#plot-matrices-of-unit-tests-success)
- [Python Usage](#python-usage)
- [Links](#links)
- [Citation](#citation)

## Install

```bash
pip install grouse
```

Then, setup your OpenAI credentials by creating an `.env` file by copying the `.env.dist` file, filling in your OpenAI API key and organization id and exporting the environment variables `export $(cat .env | xargs)`.

## Command Line Usage

### Evaluation of the Grounded Question Answering task

You can build a dataset in a `jsonl` file with the following format per line:

```txt
{
    "references": [...] # List of references,
    "input": "" # Query
    "actual_output": "", # Predicted answer generated by the model we want to evaluate
    "expected_output": "" # Ground truth answer to the input
}
```

You can also check this example `example_data/grounded_qa.jsonl`.

Then, run this command:

```bash
grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o
```

We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and the prompts with the `--model-name` and `--prompts-path` options.

### Unit Testing of Evaluators with GroUSE

Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.

```bash
grouse meta-evaluate gpt-4o meta-outputs/gpt-4o
```

### Plot Matrices of unit tests success

You can plot the results of unit tests in the shape of matrices:

```bash
grouse plot meta-outputs/gpt-4o
```

The resulting matrices look like this:

![result_matrices_plot](assets/result_matrices_plot.png)

## Python Usage

```python
from grouse import EvaluationSample, GroundedQAEvaluator

sample = EvaluationSample(
    input="What is the capital of France?",
    # Replace this with the actual output from your LLM application
    actual_output="The capital of France is Marseille.",
    expected_output="The capital of France is Paris.",
    references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])
```

## Links

- [Unit Tests](https://huggingface.co/datasets/illuin/grouse)
<!-- TODO Add link to the model: - [Llama 3 8B Model]() -->

## Citation

```latex
@misc{muller2024grouse,
      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, 
      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
}
```
