Metadata-Version: 2.4
Name: probe-llm
Version: 0.1.0
Summary: Behavioral testing framework for LLMs — test properties, not benchmarks.
Author-email: Abhinay Singh <abhinaysingh6324@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/AbhinaySingh6324/probe-llm
Project-URL: Repository, https://github.com/AbhinaySingh6324/probe-llm
Project-URL: Issues, https://github.com/AbhinaySingh6324/probe-llm/issues
Keywords: llm,testing,metamorphic,behavioral,evaluation,ai,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: httpx>=0.25
Requires-Dist: pyyaml>=6.0
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: numpy>=1.24
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.30; extra == "anthropic"
Provides-Extra: all
Requires-Dist: openai>=1.0; extra == "all"
Requires-Dist: anthropic>=0.30; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"

# 🔬 probe

**Behavioral testing framework for LLMs — test properties, not benchmarks.**

---

Benchmarks tell you a model scores 85% on MMLU.
`probe` tells you whether the model **actually behaves consistently** on *your* tasks.

```bash
pip install -e ".[openai]"
```

## Quick Start

### Python

```python
from probe import quick_test, print_summary

results = quick_test(
    model="openai:gpt-4o-mini",
    inputs=[
        "What is the capital of France?",
        "Is 17 a prime number?",
        "Explain photosynthesis in 2 sentences.",
    ],
    properties=["consistency", "robustness"],
)

print_summary(results)
```

### CLI

```bash
probe run --model openai:gpt-4o-mini --input "What is the capital of France?" -p consistency
probe run --model openai:gpt-4o-mini --inputs test_cases.txt -p consistency,invariance,robustness
probe compare --model-a openai:gpt-4o --model-b ollama:llama3 --inputs test_cases.txt
probe list-properties
```

## Behavioral Properties

| Property | What It Tests |
|----------|-------------|
| **consistency** | Rephrase question N ways → same answer? |
| **invariance** | Change irrelevant details → answer holds? |
| **negation** | Negate the question → answer flips? |
| **robustness** | Add typos → model still works? |

## Providers

```
openai:gpt-4o-mini        # OpenAI
anthropic:claude-sonnet-4-20250514  # Anthropic
ollama:llama3              # Local via Ollama
```

## License

MIT
