Metadata-Version: 2.4
Name: sik-llm-eval
Version: 0.0.2
Summary: sik-llm-eval is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: anthropic>=0.54.0
Requires-Dist: openai>=1.86.0
Requires-Dist: pandas>=2.3.0
Requires-Dist: pydantic>=2.11.6
Requires-Dist: python-dotenv>=1.1.0
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: requests>=2.32.4
Requires-Dist: ruamel-yaml>=0.18.14
Requires-Dist: sik-llms>=0.3.19
Requires-Dist: tenacity>=9.1.2
Requires-Dist: tiktoken>=0.9.0
Description-Content-Type: text/markdown

[![test](https://github.com/anaconda/sik-llm-eval/actions/workflows/tests.yaml/badge.svg)](https://github.com/anaconda/sik-llm-eval/actions/workflows/tests.yaml)

# sik-llm-eval

`sik-llm-eval` is a simple, yet flexible, framework primarily designed for evaluating Language Model Models (LLMs) on custom use cases.

This framework allows you to easily create tests cases, ranging from simple tests based on matching/regex, to tests that extract and execute Python code blocks (generated from the responses) to determine the percent of code blocks that successfully execute. It also allows you to create your own custom tests.

Get started with examples found in the [examples](https://github.com/anaconda/sik-llm-eval/tree/main/examples) folder.


> `sik-llm-eval` is a fork of [`anaconda/llm-eval`](https://github.com/anaconda/llm-eval). I was the original author and principal contributor to the initial codebase while it was developed at Anaconda (last commit on June 12, 2025).

---

## Using sik-llm-eval

In this framework, there are two fundamental concepts:

- **Eval**: An Eval represents a single test scenario. Each Eval defines an `input` to an LLM or agent, and "checks" which evaluate the response of the agent against the criteria specified in the check. Users can also create and custom checks.
- **Candidate**: A Candidate is a lightweight wrapper around an LLM or agent that used to standardize the inputs and outputs of the agent with the inputs and outputs associated with the Eval. In other words, different models might expect inputs to be formatted in different ways (and might return responses formatted in different ways) and a Candidate is an adaptor for those models so that the Evals can be defined in one format, regardless of the various formats expected by various models.

### Examples

#### Running Evals/Candidates from YAML files

You can define Candidates and Evals using YAML files. Here's an example YAML file for a ChatGPT Candidate:

```yaml
model: gpt-4o-mini
candidate_type: OPENAI
metadata:
  name: OpenAI GPT-4o-mini
parameters:
  temperature: 0.01
  max_tokens: 4096
  seed: 42
```

Here's an example of a YAML file that defines an Eval, focusing on generating a Fibonacci sequence function and corresponding assertion statements:

```yaml
metadata:
  name: Fibonacci Sequence
input:
  - role: user
    content: Create a Python function called `fib` that takes an integer `n` and returns the `n`th number in the Fibonacci sequence. Use type hints and docstrings.
checks:
  - check_type: REGEX
    pattern: "def fib\\([a-zA-Z_]+\\: int\\) -> int\\:"
  - check_type: PYTHON_CODE_BLOCK_TESTS
    code_setup: import re
    code_tests:
    - |
      def verify_mask_emails_with_no_email_returns_original_string(code_blocks: list[str]) -> bool:
          value = 'This is a string with no email addresses'
          return mask_emails(value) == value
```

The Eval above defines various types of checks, including a `PYTHON_CODE_BLOCK_TESTS` check, which executes the code blocks generated by the LLM in an isolated environment and tracks the number of code blocks that successfully execute. It also allows you to define custom tests to directly test the variables or functions created by the code blocks (in the same isolated environment).

The following code loads the Eval and Candidate from above (with the addition of a ChatGPT 4.0 Candidate and an Eval testing the creation of a "mask_emails" function). You can then run these Evals against Candidates with an `EvalHarness`:

```python
from sik_llm_eval.eval import EvalHarness

eval_harness = EvalHarness()
eval_harness.add_eval_from_yaml('examples/evals/simple_example.yaml')
eval_harness.add_eval_from_yaml('examples/evals/mask_emails.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')
eval_harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
results = eval_harness()

print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")
print(result.response)
```

`results` contains a list of lists of EvalResults. Each item in the outer list corresponds to a single Candidate and contains a list of EvalResults for all Evals run against the Candidate. In our example, `results` is [[EvalResult, EvalResult], [EvalResult, EvalResult]] where the first list corresponds to results of the Evals associated with the first Candidate (ChatGPT 4o-mini) and the second list corresponds to results of the Evals associated with the second Candidate (ChatGPT 4.0).

Note that you can load multiple YAML files in a directory using `add_evals_from_yamls` and `add_candidates_from_yamls`:

```python
...
eval_harness.add_evals_from_yamls('examples/evals/*.yaml')
eval_harness.add_candidate_from_yamls('examples/candidates/*.yaml')
...
```

## Installing

`uv install sik-llm-eval` or `pip install sik-llm-eval`

### Environment Variables

The following environment variables are required for using the built-in OpenAI and Anthropic Candidates:

- `OPENAI_API_KEY`: This environment variable and API key are required for using OpenAIChat and OpenAICandidate.
- `ANTHROPIC_API_KEY`: This environment variable and API key are required for using AnthropicChat and AnthropicCandidate.

## Contributing

If you would like to contribute to `sik-llm-eval`, please fork the repository and submit a pull request.

See `Makefile` for building environment and running tests.
