Metadata-Version: 2.4
Name: arize-phoenix-evals
Version: 2.0.0
Summary: LLM Evaluations
Project-URL: Documentation, https://arize.com/docs/phoenix/
Project-URL: Issues, https://github.com/Arize-ai/phoenix/issues
Project-URL: Source, https://github.com/Arize-ai/phoenix
Author-email: Arize AI <phoenix-devs@arize.com>
License: Elastic-2.0
License-File: IP_NOTICE
License-File: LICENSE
Keywords: Explainability,Monitoring,Observability
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.8
Requires-Dist: jsonpath-ng
Requires-Dist: openinference-semantic-conventions>=0.1.19
Requires-Dist: opentelemetry-api
Requires-Dist: pandas
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pystache
Requires-Dist: tqdm
Requires-Dist: typing-extensions<5,>=4.5
Provides-Extra: dev
Requires-Dist: anthropic>0.18.0; extra == 'dev'
Requires-Dist: boto3; extra == 'dev'
Requires-Dist: litellm>=1.28.9; extra == 'dev'
Requires-Dist: mistralai>=1.0.0; extra == 'dev'
Requires-Dist: openai>=1.0.0; extra == 'dev'
Requires-Dist: vertexai; extra == 'dev'
Provides-Extra: test
Requires-Dist: anthropic>=0.18.0; extra == 'test'
Requires-Dist: boto3; extra == 'test'
Requires-Dist: lameenc; extra == 'test'
Requires-Dist: litellm>=1.28.9; extra == 'test'
Requires-Dist: mistralai>=1.0.0; extra == 'test'
Requires-Dist: nest-asyncio; extra == 'test'
Requires-Dist: openai>=1.0.0; extra == 'test'
Requires-Dist: openinference-semantic-conventions; extra == 'test'
Requires-Dist: pandas; extra == 'test'
Requires-Dist: pandas-stubs<=2.0.2.230605; extra == 'test'
Requires-Dist: respx; extra == 'test'
Requires-Dist: tqdm; extra == 'test'
Requires-Dist: types-tqdm; extra == 'test'
Requires-Dist: typing-extensions<5,>=4.5; extra == 'test'
Requires-Dist: vertexai; extra == 'test'
Description-Content-Type: text/markdown

# arize-phoenix-evals

<p align="center">
    <a href="https://pypi.org/project/arize-phoenix-evals/">
        <img src="https://img.shields.io/pypi/v/arize-phoenix-evals" alt="PyPI Version">
    </a>
    <a href="https://arize-phoenix.readthedocs.io/projects/evals/en/latest/index.html">
        <img src="https://img.shields.io/badge/docs-blue?logo=readthedocs&logoColor=white" alt="Documentation">
    </a>
</p>

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

- Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function

## Installation

Install the arize-phoenix-evals sub-package via `pip`

```shell
pip install arize-phoenix-evals
```

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

```shell
pip install 'openai>=1.0.0'
```

## Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via `pip`

```shell
pip install scikit-learn
```

```python
import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")
```

To learn more about LLM Evals, see the [LLM Evals documentation](https://arize.com/docs/phoenix/concepts/llm-evals/).

## Documentation

- **[Full Documentation](https://arize-phoenix.readthedocs.io/projects/evals/en/latest/index.html)** - Complete API reference and guides
- **[Phoenix Docs](https://arize.com/docs/phoenix)** -Detailed use-cases and examples
- **[OpenInference](https://github.com/Arize-ai/openinference)** - Auto-instrumentation libraries for frameworks

## Community

Join our community to connect with thousands of AI builders:

- 🌍 Join our [Slack community](https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg).
- 📚 Read the [Phoenix documentation](https://arize.com/docs/phoenix).
- 💡 Ask questions and provide feedback in the _#phoenix-support_ channel.
- 🌟 Leave a star on our [GitHub](https://github.com/Arize-ai/phoenix).
- 🐞 Report bugs with [GitHub Issues](https://github.com/Arize-ai/phoenix/issues).
- 𝕏 Follow us on [𝕏](https://twitter.com/ArizePhoenix).
- 🗺️ Check out our [roadmap](https://github.com/orgs/Arize-ai/projects/45) to see where we're heading next.
