Metadata-Version: 2.4
Name: nexuseval
Version: 0.5.0
Summary: A production-ready LLM evaluation framework with intelligent caching and dataset management.
Project-URL: Homepage, https://github.com/ShubhamSalokhe/nexuseval
Project-URL: Issues, https://github.com/ShubhamSalokhe/nexuseval/issues
Author-email: Shubham Salokhe <shubhamsalokhe@ymail.com>
License: MIT
License-File: LICENSE
Keywords: ai,evaluation,hallucination,llm,openai,rag,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: numpy>=1.20.0
Requires-Dist: openai>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: tqdm>=4.0.0
Provides-Extra: all
Requires-Dist: aiohttp>=3.9.0; extra == 'all'
Requires-Dist: anthropic>=0.39.0; extra == 'all'
Requires-Dist: google-generativeai>=0.8.0; extra == 'all'
Requires-Dist: groq>=0.12.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2.0; extra == 'embeddings'
Provides-Extra: providers
Requires-Dist: aiohttp>=3.9.0; extra == 'providers'
Requires-Dist: anthropic>=0.39.0; extra == 'providers'
Requires-Dist: google-generativeai>=0.8.0; extra == 'providers'
Requires-Dist: groq>=0.12.0; extra == 'providers'
Description-Content-Type: text/markdown

# NexusEval 🧠

[![PyPI version](https://badge.fury.io/py/nexuseval.svg)](https://badge.fury.io/py/nexuseval)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-67%20passed-brightgreen)](./tests)

**The production-ready framework for evaluating RAG pipelines and LLM reliability.**

NexusEval helps developers measure the quality of their Large Language Model applications using the "Golden Triad" of evaluation: **Faithfulness**, **Answer Relevance**, and **Completeness**. Built for speed with native asynchronous support, intelligent caching, and comprehensive dataset management.

---

## ✨ Why NexusEval?

| Feature | Benefit |
|---------|---------|
| 🚀 **60-80% Cost Reduction** | Smart caching eliminates redundant API calls |
| ⚡ **3-5x Faster** | Async processing + cache = blazing speed |
| � **Multi-Provider Support** | 5 LLM providers: OpenAI, Anthropic, Google, Groq, Ollama |
| �📊 **Dataset Management** | Load from JSON/CSV/JSONL - ready in seconds |
| 🎯 **Advanced Metrics** | 8 metrics including bias/toxicity detection |
| 💰 **Cost Tracking** | Monitor every dollar spent on evaluation |
| � **Production Ready** | Preset configs for dev, staging, and production |
| ✅ **Comprehensive Tests** | 67 tests ensure reliability |

---

## 🚀 Quick Start (30 seconds)

```bash
# Install
pip install nexuseval

# Set your API key
export OPENAI_API_KEY="sk-..."
```

```python
from nexuseval import TestCase, Evaluator, Faithfulness, AnswerRelevance

# Create a test case
case = TestCase(
    input_text="What is the capital of France?",
    actual_output="Paris is the capital of France.",
    retrieval_context=["France is a country in Europe.", "Paris is a major city."]
)

# Evaluate with automatic caching
evaluator = Evaluator(metrics=[Faithfulness(), AnswerRelevance()])
results = evaluator.evaluate([case])

print(results)
```

**That's it!** Caching is enabled by default, so running this again will be 5x faster and cost 80% less. 💸

---

## 📦 Installation

### Standard Installation (OpenAI only)

```bash
pip install nexuseval
```

### With All Providers

```bash
# Install with all LLM provider support
pip install nexuseval[all]

# Or install specific providers
pip install nexuseval[providers]  # All LLM providers
pip install nexuseval[embeddings]  # For SemanticSimilarity metric
```

### Development Installation

```bash
git clone https://github.com/ShubhamSalokhe/nexuseval.git
cd nexuseval
pip install -e ".[all]"  # Install with all extras
pip install pytest pytest-asyncio  # For running tests
```

### Requirements

- Python 3.9+
- OpenAI API key (set as `OPENAI_API_KEY` environment variable) - **Required**
- Optional API keys for other providers:
  - `ANTHROPIC_API_KEY` for Claude
  - `GOOGLE_API_KEY` for Gemini
  - `GROQ_API_KEY` for Groq
  - Ollama running locally for local models

---

## 🆕 What's New in v0.5.0

### 🔌 Multi-Provider LLM Support

Switch between 5 LLM providers seamlessly:

```python
from nexuseval import LLMClient

# OpenAI (default)
client = LLMClient()  # Uses gpt-4-turbo

# Anthropic Claude
client = LLMClient(provider="anthropic", model="claude-3-5-sonnet-20241022")

# Google Gemini (cheapest!)
client = LLMClient(provider="google", model="gemini-1.5-flash")

# Groq (ultra-fast)
client = LLMClient(provider="groq", model="llama-3.3-70b-versatile")

# Ollama (local, free!)
client = LLMClient(provider="ollama", model="llama3")
```

**Install optional providers:**
```bash
# All providers
pip install nexuseval[all]

# Or individually
pip install anthropic  # Claude
pip install google-generativeai  # Gemini
pip install groq  # Groq
pip install aiohttp  # Ollama
```

**Cost Comparison (1000 evaluations):**
| Provider | Model | Cost |
|----------|-------|------|
| OpenAI | gpt-4-turbo | ~$15.00 |
| OpenAI | gpt-4o-mini | ~$0.50 |
| Anthropic | claude-3-5-sonnet | ~$10.00 |
| Google | gemini-1.5-flash | **~$0.20** ⭐ |
| Groq | llama-3.3-70b | ~$0.70 |
| Ollama | llama3 (local) | **FREE** 🎉 |

### 🎯 Advanced Metrics (5 New)

Beyond the standard triad, now includes:

```python
from nexuseval import (
    ContextRelevance,      # Measures retrieval precision
    SemanticSimilarity,    # Embedding-based comparison
    BiasDetection,         # Detects 6 bias types
    ToxicityDetection,     # Flags harmful content
    FactualConsistency     # Verifies claims against context
)

# Context Relevance - checks retrieval quality
metric = ContextRelevance(threshold=0.7)
result = await metric.measure(test_case)

# Semantic Similarity - compare with expected output
metric = SemanticSimilarity(embedding_provider="openai")

# Bias Detection - automatic safety checks
metric = BiasDetection()

# Use with evaluator
evaluator = Evaluator(metrics=[
    Faithfulness(),
    ContextRelevance(),
    BiasDetection()
])
```

**Install embeddings support:**
```bash
pip install nexuseval[embeddings]  # For SemanticSimilarity
```

### 📊 Dataset Management

Load evaluation datasets from multiple formats:

```python
from nexuseval import DatasetLoader

# From JSON
dataset = DatasetLoader.from_json("evals.json")

# From CSV with column mapping
dataset = DatasetLoader.from_csv(
    "data.csv",
    column_mapping={
        "question": "input_text",
        "response": "actual_output"
    }
)

# Generate samples for testing
from nexuseval import SampleDataGenerator
dataset = SampleDataGenerator.generate_rag_samples(n=10)

# Split for train/test
train, test = dataset.split(train_ratio=0.8, shuffle=True)
```

### 💾 Smart Caching

Automatically cache LLM responses to reduce costs:

```python
from nexuseval import NexusConfig

# Use preset with optimal caching
config = NexusConfig.preset_development()

# Or configure manually
config = NexusConfig(
    cache=CacheConfig(
        enabled=True,
        backend="file",  # "memory", "file", or "redis"
        max_size=5000
    )
)
```

**Performance:**
- 🚀 **60-80% cost reduction** via intelligent caching
- ⚡ **3-5x faster** for repeated evaluations
- 📊 Built-in hit rate tracking

### 💰 Cost Tracking

Monitor API costs in real-time:

```python
from nexuseval import NexusConfig

config = NexusConfig.preset_development()  # Cost tracking enabled

# After evaluation
evaluator = Evaluator(metrics=[Faithfulness()])
results = evaluator.evaluate(dataset.test_cases)

# Check costs
cost_stats = evaluator.metrics[0].llm.get_cost_stats()
print(f"Total cost: ${cost_stats['total_cost_usd']:.4f}")
print(f"Total tokens: {cost_stats['total_tokens']:,}")
```

---

## 📊 Evaluation Metrics

### Standard Metrics (RAG Triad)

| Metric | What it Measures | Use Case |
| --- | --- | --- |
| **Faithfulness** | Hallucination detection | Ensures answers are grounded in retrieved context |
| **Answer Relevance** | Response quality | Checks if the answer addresses the question |
| **Completeness** | Coverage | Verifies all parts of the query were answered |

### Advanced Metrics (NEW in v0.5.0)

| Metric | What it Measures | Threshold |
| --- | --- | --- |
| **Context Relevance** | Retrieval precision | 0.7 |
| **Semantic Similarity** | Answer similarity to expected | 0.8 |
| **Bias Detection** | Gender, race, religion, age, disability, nationality | 0.0 (no bias) |
| **Toxicity Detection** | Profanity, threats, hate speech, harassment | 0.0 (no toxicity) |
| **Factual Consistency** | Claims verified against context | 0.8 |

### Example: Detecting Incomplete Answers

```python
from nexuseval import Completeness, TestCase, Evaluator

# User asked for TWO things, model gave ONE
case = TestCase(
    input_text="Who is the CEO of Tesla and SpaceX?",
    actual_output="The CEO of Tesla is Elon Musk."  # ❌ Missed SpaceX
)

evaluator = Evaluator(metrics=[Completeness()])
results = evaluator.evaluate([case])

# Result: Low score (~0.5) with reason: "Failed to mention SpaceX"
```

---

## 🛠️ Advanced Features

### Preset Configurations

Choose the right mode for your environment:

```python
from nexuseval import NexusConfig

# 🔧 Development: Fast iteration, file cache
config = NexusConfig.preset_development()
# Uses: gpt-4o-mini, file cache, verbose output

# 🚀 Production: Best quality, distributed cache
config = NexusConfig.preset_production()
# Uses: gpt-4-turbo, Redis cache, high concurrency

# ⚡ Fast: Maximum speed
config = NexusConfig.preset_fast()
# Uses: gpt-3.5-turbo, in-memory cache, 30 concurrent requests
```

### Bulk Evaluation with Progress Bar

```python
# Load large dataset
dataset = DatasetLoader.from_json("1000_evals.json")

# Automatic async processing with progress bar
evaluator = Evaluator(metrics=[Faithfulness(), AnswerRelevance()])
results = evaluator.evaluate(dataset.test_cases)

# 🚀 NexusEval: Evaluating 1000 cases with 2 metrics...
# 100%|████████████| 1000/1000 [00:45<00:00, 22.1it/s]
```

### Dataset Validation

```python
from nexuseval import DatasetLoader, DatasetValidator

# Load dataset
dataset = DatasetLoader.from_json("evals.json")

# Validate
validator = DatasetValidator()
issues = validator.validate_schema(dataset, require_context=True)

if issues:
    for issue in issues:
        print(f"⚠️ {issue}")

# Check duplicates
duplicates = validator.check_duplicates(dataset)
print(f"Found {len(duplicates)} duplicate test cases")
```

### Custom Model Configuration

```python
from nexuseval import NexusConfig, LLMConfig

config = NexusConfig(
    llm=LLMConfig(
        model="gpt-4o-mini",      # Cheaper model
        temperature=0.0,           # Deterministic
        max_tokens=500             # Shorter responses
    )
)
```

---

## 📁 Dataset Formats

### JSON Format

```json
{
  "name": "my_evaluations",
  "test_cases": [
    {
      "input_text": "What is Python?",
      "actual_output": "Python is a programming language.",
      "retrieval_context": ["Python is used for AI and web development."],
      "expected_output": "A high-level programming language."
    }
  ]
}
```

### CSV Format

```csv
question,answer,context
What is AI?,Artificial Intelligence,AI simulates human intelligence
What is ML?,Machine Learning,ML is a subset of AI
```

Load with column mapping:

```python
dataset = DatasetLoader.from_csv(
    "data.csv",
    column_mapping={
        "question": "input_text",
        "answer": "actual_output",
        "context": "retrieval_context"
    }
)
```

---

## 🔍 Examples

Check out the [examples/](examples/) directory for complete working examples:

- **[basic_evaluation.py](examples/basic_evaluation.py)** - Dataset loading, caching, cost tracking
- **[dataset_management.py](examples/dataset_management.py)** - Creating, validating, and loading datasets

---

## 🧪 Running Tests

```bash
# Install test dependencies
pip install pytest pytest-asyncio

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_dataset.py -v

# Run with coverage
pytest tests/ --cov=nexuseval --cov-report=html
```

**Current Status:** ✅ 57 tests, 100% pass rate

---

## 🗺️ Roadmap

### ✅ v0.4.0 - Core Infrastructure (Released)
- Dataset management (JSON, CSV, JSONL)
- Intelligent caching system
- Cost tracking
- Configuration presets

### 🔄 v0.5.0 - Advanced Metrics (Next)
- Context Relevance (retrieval precision)
- Bias Detection
- Toxicity Detection
- Semantic Similarity
- Custom metric framework

### 📋 v0.6.0 - Reporting & Analytics
- HTML/PDF report generation
- Statistical analysis tools
- Visualization charts
- Comparative analysis

### 🤖 v0.7.0 - Multi-Model Support
- Anthropic (Claude)
- Google (Gemini)
- Local models (Ollama, vLLM)
- Unified multi-provider interface

---

## ❓ FAQ

### How much does evaluation cost?

With caching enabled (default), costs are typically:
- **First run:** $0.01-0.05 per test case (depending on model)
- **Cached runs:** $0 (uses cache)
- **Average savings:** 60-80% cost reduction

### Can I use my own LLM models?

Currently supports OpenAI models. Multi-provider support (Anthropic, Google, local models) is coming in v0.7.0. You can use any OpenAI-compatible endpoint by setting a custom `base_url`.

### How do I disable caching?

```python
config = NexusConfig(
    cache=CacheConfig(enabled=False)
)
```

### Where is the cache stored?

- **Memory cache:** RAM (lost on restart)
- **File cache:** `.nexuseval_cache/` directory (persistent)
- **Redis cache:** Your Redis server (distributed)

### Is this compatible with existing code?

Yes! All new features are opt-in. Your existing NexusEval code will continue to work without changes.

---

## 🤝 Contributing

We welcome contributions! Here's how to get started:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests (`pytest tests/ -v`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

Please ensure:
- ✅ All tests pass
- ✅ Code follows existing style
- ✅ Add tests for new features
- ✅ Update documentation

---

## 📝 License

MIT License - see [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

Built with ❤️ for the AI evaluation community. Special thanks to:
- The RAG evaluation community for inspiration
- All contributors and users providing feedback
- OpenAI for the evaluation LLM infrastructure

---

## 💬 Support & Links

- 📖 **Documentation:** [examples/](examples/) and [tests/](tests/)
- 🐛 **Bug Reports:** [GitHub Issues](https://github.com/ShubhamSalokhe/nexuseval/issues)
- 💡 **Feature Requests:** [GitHub Issues](https://github.com/ShubhamSalokhe/nexuseval/issues)
- 📧 **Email:** shubhamsalokhe@ymail.com
- 🌐 **GitHub:** [ShubhamSalokhe/nexuseval](https://github.com/ShubhamSalokhe/nexuseval)

---

## 📈 Stats

- **Version:** 0.4.0
- **Python:** 3.9+
- **License:** MIT
- **Tests:** 57 (100% passing)
- **Code Quality:** Type-safe with Pydantic
- **Performance:** 60-80% cost reduction, 3-5x speed improvement

---

**Star ⭐ the repo if you find NexusEval useful!**
