Metadata-Version: 2.4
Name: finagent-evals
Version: 0.1.1
Summary: 111 labeled eval cases, deterministic checks, and scoring for financial trading agents
Author: Ghostfolio Trading Agent Contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/ghostfolio/trading-agent
Project-URL: Issues, https://github.com/ghostfolio/trading-agent/issues
Keywords: eval,llm,trading,agent,finance,testing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Office/Business :: Financial
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Dynamic: license-file

# finagent-evals

111 labeled eval cases, deterministic check functions, and weighted scoring for financial trading agents.

Built from the [Ghostfolio Trading Agent](https://github.com/ghostfolio/trading-agent) eval suite and published as a standalone package so anyone building financial AI agents can benchmark against a curated, production-tested dataset.

## Features

- **111 eval cases** across 3 layers: golden (34), scenarios (47), dataset (30)
- **Deterministic checks** -- no LLM calls, pure Python assertions
- **7 check dimensions**: tool selection, tool execution, source citation, content validation, negative validation, ground truth, structural
- **Weighted scoring** with 6 dimensions: intent, tools, content, safety, confidence, verification
- **Mock infrastructure**: Ghostfolio client, market data (OHLCV), seed portfolio
- **Authoritative sources**: 9 bundled IRC/IRS tax references for compliance checks

## Install

```bash
pip install finagent-evals
```

## Quick Start

```python
from finagent_evals import GOLDEN_CASES, get_all_cases, run_golden_checks

# Browse cases
print(len(GOLDEN_CASES))    # 34
print(len(get_all_cases()))  # 111

# Run checks against your agent's output
case = GOLDEN_CASES[0]
result = {
    "response": {"summary": "Your portfolio has AAPL and GOOG..."},
    "tools_called": ["get_portfolio_snapshot"],
    "tool_errors": [],
    "react_step": 1,
    "latency_seconds": 2.0,
}
checks = run_golden_checks(case, result)
print(checks["passed"])  # True/False
```

## Eval Case Format

Each case from `get_all_cases()` is a dict with:

```python
{
    "id": "gs-001",
    "input": "Show me my portfolio",
    "expected_tools": ["get_portfolio_snapshot"],
    "expected_output_contains": ["portfolio"],
    "expected_output_contains_any": ["position", "holding", "value"],
    "should_not_contain": ["I don't know", "unable"],
    "ground_truth_contains": ["AAPL", "GOOG"],
    "max_react_steps": 2,
    "max_latency_seconds": 10,
    "category": "portfolio_overview",
    "case_type": "happy_path",
    "layer": "golden",   # "golden" | "scenario" | "dataset"
    "golden": True,      # True for the 34 golden cases, False otherwise
}
```

Filter golden cases with `[c for c in get_all_cases() if c["golden"]]` or `layer == "golden"`.

## Check Functions

```python
from finagent_evals import (
    check_tools,          # required tools present
    check_tools_any,      # at least one of list
    check_must_contain,   # all terms in response
    check_contains_any,   # any term in response
    check_must_not_contain,  # no forbidden terms
    check_ground_truth,   # mock-data values present
    check_structural,     # steps + latency in budget
    check_authoritative_sources,  # IRC/IRS citations
)
```

## Scoring

```python
from finagent_evals import (
    score_intent, score_tools, score_content,
    score_safety, score_ground_truth,
    aggregate_results,
    WEIGHT_INTENT,  # 0.20
    WEIGHT_TOOLS,   # 0.25
    WEIGHT_CONTENT, # 0.15
    WEIGHT_SAFETY,  # 0.15
)
```

For `aggregate_results(results)`, each result dict should include `passed`, `category`, and an overall score via **one of**: `overall_score`, `score`, or `scores["overall"]`. The aggregate summary includes `avg_overall_score`, `pass_rate_pct`, `tool_success_rate_pct`, and `by_category`.

## Mock Infrastructure

```python
from finagent_evals import MockGhostfolioClient, MOCK_LAST_CLOSE, mock_fetch_with_retry

client = MockGhostfolioClient()
holdings = client.get_holdings()  # 2 positions: AAPL, GOOG
prices = MOCK_LAST_CLOSE          # {"AAPL": 187.50, "TSLA": 248.00, ...}
```

## License

Apache-2.0

## Development

```bash
pip install -e ".[dev]"
pytest tests/ -v
```
