Metadata-Version: 2.4
Name: openenv-halluguard
Version: 2.1.1
Summary: Production RL environment for training LLMs on hallucination avoidance — 1M+ examples across 38 datasets
Project-URL: Homepage, https://huggingface.co/spaces/SamSankar/hallucination-guard-env
Project-URL: Repository, https://github.com/SS-360/Openenv-halluguard
Project-URL: Documentation, https://samsankar-hallucination-guard-env.hf.space/docs
Author: HallucinationGuard-Env Contributors
License: MIT
Keywords: ai-safety,hallucination-detection,openenv,question-answering,reinforcement-learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: bert-score>=0.3.13
Requires-Dist: datasets>=2.14.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy<2.0.0,>=1.24.0
Requires-Dist: openenv-core>=0.2.0
Requires-Dist: protobuf<5.0.0,>=3.20.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-json-logger>=2.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: rouge-score>=0.1.2
Requires-Dist: sentence-transformers<3.0.0,>=2.7.0
Requires-Dist: transformers<5.0.0,>=4.35.0
Requires-Dist: uvicorn>=0.23.0
Provides-Extra: dev
Requires-Dist: httpx>=0.24.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

---
title: HallucinationGuard-Env
emoji: 🛡️
colorFrom: gray
colorTo: blue
sdk: docker
app_port: 7860
pinned: true
tags:
  - openenv
  - reinforcement-learning
  - hallucination-detection
  - grounded-generation
  - question-answering
  - fact-checking
  - llm-training
  - llm-evaluation
  - benchmark
  - ai-safety
---

# 🛡️ HallucinationGuard-Env v4.2

> **The production-grade OpenEnv RL environment for training and evaluating LLMs on hallucination avoidance.**

[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
[![PyPI](https://img.shields.io/badge/PyPI-openenv--halluguard-orange)](https://pypi.org/project/openenv-halluguard/)
[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)
[![Dataset](https://img.shields.io/badge/Dataset-1M%2B_examples-orange)](#datasets)
[![Grader](https://img.shields.io/badge/Grader-research--grade-purple)](#-hallucination-detection)

---

## 💡 The Inspiration

During research for a Hackathon, an AI model confidently hallucinated a **"golden ticket backdoor"** — claiming that Ideathon winners could skip directly to the Grand Finale. This information existed nowhere in the official sources. The AI stated it with high confidence and even fabricated a supporting quote.

That moment made one thing clear: hallucination isn't just an academic problem. It causes real confusion in high-stakes situations.

**HallucinationGuard-Env** was built to fix that — training AI models to say *"I don't know"* when they don't, cite real sources when they do, and never fabricate with confidence.

---

## 🚀 Quick Start

### Python SDK

```bash
pip install openenv-halluguard
```

```python
from hallucination_guard_env import HallucinationEnv

# Define your model
def my_model(question, context):
    # Call your LLM here — answer from context only
    return "your answer based on context"

# Evaluate in 5 lines
env = HallucinationEnv()
obs = env.reset()
action = my_model(obs.question, obs.context)
result = env.step(answer=action, confidence=0.8)
print(f"Reward: {result.reward}, Hallucinated: {result.is_hallucination}")
```

### Raw HTTP

```python
import requests

BASE = "https://samsankar-hallucination-guard-env.hf.space"

# 1. Start episode
obs = requests.post(f"{BASE}/reset").json()
print(obs["question"], obs["context"])

# 2. Answer from context only
result = requests.post(f"{BASE}/step", json={"answer": "your answer"}).json()
print(f"Reward: {result['reward']}, Hallucinated: {result['is_hallucination']}")

# 3. View leaderboard
print(requests.get(f"{BASE}/leaderboard").json())
```

### Run Locally

```bash
git clone https://huggingface.co/spaces/SamSankar/hallucination-guard-env
cd hallucination-guard-env
pip install -r server/requirements.txt
uvicorn server.app:app --reload --port 7860
curl http://localhost:7860/health
```

---

## 📁 Project Structure

```
hallucination-guard-env/
├── Dockerfile                    # HF Spaces Docker config
├── pyproject.toml                # Package metadata
├── openenv.yaml                  # OpenEnv manifest
├── README.md                     # This file
│
├── server/                       # FastAPI backend
│   ├── app.py                    # Main FastAPI application (endpoints)
│   ├── environment.py            # Core RL environment logic
│   ├── grader.py                 # 9-component reward system
│   ├── dataset_loader.py         # 38 dataset loader with HF cache
│   ├── tasks.py                  # Task registry (3 tasks)
│   ├── metrics.py                # Real-time metrics tracker
│   ├── requirements.txt          # Python dependencies
│   └── Dockerfile                # Server Docker image
│
├── models.py                     # Data models (Action, Observation, State)
├── client.py                     # HTTP/WebSocket client
└── inference.py                  # Baseline inference script (hackathon submission)
```

---

## 🎯 Tasks

HallucinationGuard-Env exposes **3 named tasks** in difficulty order:

| # | task_id | Difficulty | Primary Datasets | Frontier LLM Score |
|---|---------|-----------|-----------------|-------------------|
| 1 | `task_1_factual_grounding` | 🟢 Beginner | SQuAD, BoolQ, ARC, OpenBookQA | 0.70–0.85 |
| 2 | `task_2_multi_hop_synthesis` | 🟡 Intermediate | HotpotQA, CoQA, NQ-Open, MS-MARCO | 0.55–0.70 |
| 3 | `task_3_adversarial_resistance` | 🔴 Advanced | HaluEval, TruthfulQA, FEVER, AdversarialQA | 0.40–0.60 |

---

## 🎮 How The Environment Works

The agent receives a **question** and a **source document**. It must answer using only what the document says, provide a direct quote supporting its answer, and state how confident it is.

### Action Space

Every `POST /step` call accepts this JSON body (only `answer` is required):

```json
{
    "answer":           "string — derived ONLY from the provided context",
    "confidence":       0.5,     // float 0.0–1.0, calibrated estimate
    "source_quote":     "string — verbatim phrase from context supporting the answer",
    "reasoning":        "string — optional chain-of-thought",
    "uncertainty_flags": []      // list of aspects the agent is unsure about
}
```

### Observation Space

```python
@dataclass
class HallucinationObservation:
    question: str                  # The question to answer
    context: str                   # Source document to answer from
    reward: float                  # Step reward (-1.0 to 1.0)
    feedback: str                  # Detailed human-readable feedback
    is_hallucination: bool         # Was hallucination detected?
    hallucination_type: str        # Type of hallucination detected
    hallucination_severity: str    # NONE / MINOR / MODERATE / SEVERE / CRITICAL
    grounding_score: float         # How well answer is grounded in context
    accuracy_so_far: float         # Running accuracy this episode
    skill_rating: float            # ELO-style skill rating
    attempts_remaining: int        # Steps left in episode
    done: bool                     # Episode complete?
```

### Episode Flow

```
reset()
  → Sample question + context from dataset (curriculum-aware)
  → Return initial observation

step(action)
  → Grade answer across 9 research-grade components
  → Detect hallucination type and severity
  → Compute multi-factor reward with ROUGE + BERTScore + AlignScore
  → Adapt difficulty based on performance
  → Return observation with reward + rich feedback

state()
  → Return episode metadata: ID, step count, skill rating, curriculum stage
```

---

## 📊 Reward System (v4.1 — Research-Grade)

| Component | Weight | Description |
|-----------|--------|-------------|
| Factual correctness | 0.30 | Exact/fuzzy match + semantic similarity to ground truth |
| Source grounding | 0.20 | Verifies answer is supported by context |
| Citation accuracy | 0.15 | `source_quote` found verbatim in context |
| Confidence calibration | 0.15 | ECE between stated confidence and correctness |
| Semantic consistency | 0.10 | NLI entailment score (DeBERTa-v3-base) |
| Hallucination penalty | 0.10 | Penalises detected hallucinations |
| ROUGE (1/2/L) | 0.05 | Surface-form overlap with reference answer |
| BERTScore (DeBERTa) | 0.05 | Token-level semantic similarity |
| AlignScore | 0.05 | Faithfulness to context (RoBERTa, ACL 2023) |

Difficulty multiplier: `beginner × 0.9`, `intermediate × 1.0`, `advanced × 1.1`, `expert × 1.2`

```
reward = clamp(Σ(weight × score) × difficulty_multiplier + consistency_bonus, 0.0, 1.0)
```

---

## 📦 Batch Evaluation

For high-throughput model benchmarking, use the batch endpoints:

### POST /batch/evaluate

Evaluate multiple question-answer pairs in one request:

```json
{
  "items": [
    {
      "question": "What is the capital of France?",
      "context": "The capital of France is Paris.",
      "answer": "Paris",
      "confidence": 0.9,
      "source_quote": "capital of France is Paris",
      "ground_truth": "Paris"
    }
  ],
  "task_id": "task_1_factual_grounding"
}
```

**Response:**
```json
{
  "total_items": 1,
  "results": [
    {
      "index": 0,
      "reward": 0.85,
      "is_hallucination": false,
      "correctness": 1.0,
      "explanation": "Answer is correct and well-grounded."
    }
  ],
  "summary": {
    "avg_reward": 0.85,
    "hallucination_rate": 0.0,
    "score_distribution": {"high": 1, "medium": 0, "low": 0}
  }
}
```

### POST /batch/stream

For large batches (100+ items), stream results as NDJSON:

```python
import requests
import json

response = requests.post(f"{BASE}/batch/stream", json={
    "items": [...],  # 100+ items
    "task_id": "task_1_factual_grounding"
}, stream=True)

for line in response.iter_lines():
    result = json.loads(line)
    print(f"Item {result['index']}: {result['reward']}")
```

---

## 🔬 Hallucination Detection

### 8 Types Classified

| Type | What It Catches |
|---|---|
| `FABRICATED_FACT` | Information stated that is not in the source |
| `FALSE_CITATION` | `source_quote` that does not exist in the document |
| `OVERCONFIDENT_WRONG` | High confidence on an incorrect answer |
| `CONTEXT_DRIFT` | Answer gradually drifts away from source |
| `NUMERICAL_FABRICATION` | Made-up statistics or numbers |
| `ENTITY_CONFUSION` | Wrong names, organisations, or places |
| `TEMPORAL_ERROR` | Incorrect dates or timelines |
| `RELATIONSHIP_ERROR` | Incorrect relationships between entities |

### "I Don't Know" Refusal Handling

The grader now detects when a model appropriately refuses to answer unanswerable questions:

| Scenario | Reward | Behavior |
|----------|--------|----------|
| Proper refusal on unanswerable | 0.65–0.80 | Rewarded for honesty |
| Refusal with low confidence | 0.50 | Partial credit |
| Underconfident refusal (answer exists) | 0.30 | Penalized for not trying |

**Detected refusal phrases:** "I cannot answer", "not in the context", "I don't know", "cannot determine", "insufficient information", etc.

### Hallucination Explanations

When hallucination is detected, the grader returns a human-readable explanation:

```json
{
  "hallucination_explanation": "Entity hallucination (80%): Answer contains names/entities not in source | Overconfidence (40%): Confidence exceeds answer quality"
}
```

Components explained:
- **Entity hallucination** — Fabricated names/entities detected
- **Numerical fabrication** — Made-up numbers
- **Low word coverage** — Percentage of answer words not in context
- **Ground truth mismatch** — Answer differs from correct answer
- **Overconfidence** — Confidence level exceeds answer quality

### 5 Severity Levels

| Level | Score | Meaning |
|---|---|---|
| NONE | 0.0 | Fully grounded answer |
| MINOR | 0.1–0.3 | Slight deviation from source |
| MODERATE | 0.3–0.5 | Noticeable unsupported claims |
| SEVERE | 0.5–0.7 | Significantly fabricated content |
| CRITICAL | 0.7+ | Answer largely invented |

---

## 📚 Datasets

**1,090,163 total examples** across 38 real-world QA datasets — cached permanently, instant boot:

| Source | Examples | Domain |
|---|---|---|
| SQuAD + SQuAD-v2 | 100,000 | Reading comprehension |
| TriviaQA | 50,000 | Open-domain factual QA |
| HotpotQA | 50,000 | Multi-hop reasoning |
| DROP | 50,000 | Numerical reasoning |
| RACE | 50,000 | Exam reading comprehension |
| NewsQA | 50,000 | News article QA |
| FaithDial / HH-RLHF | 49,649 | Faithful dialogue |
| FEVER / SNLI | 49,947 | Fact verification |
| NQ Open | 50,000 | Natural questions |
| AQUA-RAT | 97,467 | Math word problems |
| XSum | 49,994 | Extreme summarisation |
| CNN/DailyMail | 50,000 | News summarisation |
| HellaSwag | 39,905 | Commonsense completion |
| AdversarialQA | 30,000 | Adversarial reading comprehension |
| WinoGrande | 40,398 | Commonsense inference |
| CommonsenseQA | 9,741 | Commonsense reasoning |
| BoolQ | 9,427 | Boolean yes/no QA |
| CoQA | 7,199 | Conversational QA |
| MedQA | 10,000 | Medical licensing exam |
| MedMCQA | 20,000 | Medical entrance exam |
| SciTail | 23,596 | Science entailment |
| HaluEval | 10,000 | Hallucination evaluation |
| TruthfulQA | 817 | Factuality benchmark |
| QASC | 8,134 | Multi-hop science |
| QUAIL | 10,246 | Reading comprehension |
| SciQ | 11,679 | Science QA |
| Circa | 31,525 | Social context QA |
| ARC | 2,590 | Science exam |
| OpenBookQA | 4,957 | Common knowledge |
| AG News | 50,000 | News classification |
| QuaRTz | 2,696 | Qualitative science |
| Climate-FEVER | 881 | Climate fact verification |
| PubMedQA | 1,000 | Biomedical QA |
| Medical QA Pairs | 3,000 | Medical question similarity |
| MS MARCO | 30,568 | Web search QA |

---

## 🔧 OpenEnv Required Endpoints

### `GET /tasks`
Returns all 3 task definitions and the complete action schema.

### `POST /grader`
Score a completed episode. Pass the per-step rewards and info dicts collected during the episode.

### `POST /baseline`
Run the built-in heuristic agent across all 3 tasks. No API key required.

---

## 🌐 Deployment (HuggingFace Spaces)

### Startup Optimization

The environment uses a **two-phase loading strategy**:

1. **Core datasets** (~50K examples) load synchronously at startup
2. **Extended datasets** (~1M examples) load in background after server is healthy

This ensures fast cold starts while maintaining full dataset availability.

### Configuration

```dockerfile
# Dockerfile optimized for HF Spaces
FROM python:3.10-slim

# Pre-download ML models during build (saves ~2min startup)
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
RUN python -c "from sentence_transformers import CrossEncoder; CrossEncoder('cross-encoder/nli-deberta-v3-base')"

# Health check with 5-minute start-period for cold boot
HEALTHCHECK --interval=30s --timeout=15s --start-period=300s --retries=10 \
    CMD curl -f http://localhost:7860/health || exit 1
```

---

## 📀 API Endpoints

### Environment

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Submit an answer |
| `GET` | `/state` | Get current episode state |

### Sessions

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/session/reset` | Create a stateful session |
| `POST` | `/session/step` | Step in a session |
| `DELETE` | `/session` | Close a session |

### OpenEnv

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/tasks` | List all tasks + action schema |
| `POST` | `/grader` | Score a completed episode |
| `POST` | `/baseline` | Run baseline agent |
| `GET` | `/metadata` | Environment metadata |
| `GET` | `/schema` | Action/Observation/State schemas |
| `POST` | `/mcp` | MCP JSON-RPC endpoint |

### Leaderboard

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/leaderboard` | Model leaderboard |
| `POST` | `/leaderboard/submit` | Submit evaluation results |

### Metrics

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/metrics` | Real-time metrics |
| `GET` | `/metrics/summary` | Metrics summary report |
| `GET` | `/metrics/timing` | Time-per-step metrics |

### Batch Evaluation

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/batch/evaluate` | Evaluate multiple Q&A pairs in one request |
| `POST` | `/batch/stream` | Streaming batch evaluation (NDJSON) |

### Visualization

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/leaderboard/viz` | Leaderboard chart data (bar, scatter, tiers) |

---

## 📋 Baseline Scores

### Heuristic Baseline (no LLM required)

The heuristic baseline is a deterministic agent that establishes a performance floor. It demonstrates what happens when an agent **completely ignores the question** and simply returns the first sentence of the context.

#### How It Works

```python
def heuristic_agent(question: str, context: str) -> dict:
    # 1. Extract first sentence from context (ignoring the question entirely)
    sentences = [s.strip() for s in context.split(".") if len(s.strip()) > 10]
    answer = sentences[0] if sentences else context[:120]

    # 2. Use fixed confidence (not calibrated)
    confidence = 0.6

    # 3. Return first 80 chars as "source quote" (often irrelevant)
    source_quote = context[:80]

    return {"answer": answer, "confidence": confidence, "source_quote": source_quote}
```

**Why this baseline?** It represents the absolute minimum viable agent — one that processes context but doesn't understand questions. Any real LLM should beat this by reading the question and finding relevant context.

#### Testing Methodology

We ran the heuristic baseline **5 times** on a local server (reproducible conditions) with seeds 42-46:

```bash
# Run locally for reproducible results
uvicorn server.app:app --port 7860
python inference.py --heuristic --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
```

#### Results (5 Runs, Seeds 42-46)

| Seed | Task 1 | Task 2 | Task 3 | Overall | Time |
|------|--------|--------|--------|---------|------|
| 42 | 0.151 | 0.076 | 0.037 | **0.088** | 56s |
| 43 | 0.194 | 0.105 | 0.125 | **0.141** | 52s |
| 44 | 0.181 | 0.074 | 0.112 | **0.122** | 48s |
| 45 | 0.221 | 0.062 | 0.142 | **0.142** | 51s |
| 46 | 0.129 | 0.002 | 0.037 | **0.056** | 44s |
| **Mean** | **0.175** | **0.064** | **0.090** | **0.110** | **50s** |
| **Std Dev** | ±0.034 | ±0.038 | ±0.046 | ±0.036 | ±4s |

#### Aggregated Baseline Score

| Task | Mean Score | Std Dev | 95% CI |
|------|------------|---------|--------|
| task_1_factual_grounding | 0.175 | ±0.034 | [0.14, 0.21] |
| task_2_multi_hop_synthesis | 0.064 | ±0.038 | [0.03, 0.10] |
| task_3_adversarial_resistance | 0.090 | ±0.046 | [0.05, 0.13] |
| **Overall** | **0.110** | ±0.036 | [0.07, 0.15] |

> **Note on Variance**: The high variance (±33% relative std dev) is expected because:
> 1. The heuristic ignores questions — it lucks into correct answers when the first sentence happens to be relevant
> 2. Different seeds sample different question/context pairs from 38 datasets
> 3. Task 2 has the lowest scores because multi-hop reasoning requires understanding questions

### LLM Baseline (requires API key)

Real LLMs understand questions and find relevant context. Here's how to run them:

```bash
# Set required environment variables
export API_BASE_URL=https://api.groq.com/openai/v1
export MODEL_NAME=qwen/qwen3-32b
export HF_TOKEN=gsk_your_key_here

# Run inference
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
```

### Tested LLM Results

We tested multiple LLMs on this benchmark. All tests used: **3 episodes × 5 steps, seed=42, local server**.

#### Leaderboard

| Rank | Model | Provider | Overall | Task 1 | Task 2 | Task 3 | Time |
|------|-------|----------|---------|--------|--------|--------|------|
| 🥇 | **qwen/qwen3-32b** | Groq (cloud) | **0.51** | 0.56 | 0.48 | 0.47 | 277s |
| 🥈 | **Llama 3.3 70B** | Groq (cloud) | **0.45** | 0.52 | 0.43 | 0.41 | 45s |
| 🥉 | **Llama 3.1 8B** | Groq (cloud) | **0.42** | 0.48 | 0.40 | 0.38 | 40s |
| 4 | GLM-4.5-Air | OpenRouter (cloud) | **0.26** | 0.22 | 0.34 | 0.23 | 960s |
| 5 | Qwen2.5-72B-Instruct | HF Router (cloud) | **0.24** | 0.28 | 0.13 | 0.31 | 161s |
| - | Heuristic (5-run avg) | — | 0.11 | 0.18 | 0.06 | 0.09 | 50s |

#### Performance Analysis

| Model | vs Baseline | Hackathon Req (≥0.20) | Speed |
|-------|-------------|------------------------|-------|
| qwen/qwen3-32b | **4.6× baseline** | ✅ 2.5× above | Medium (277s) |
| Llama 3.3 70B | **4.1× baseline** | ✅ 2.3× above | Fast (45s) |
| Llama 3.1 8B | **3.8× baseline** | ✅ 2.1× above | Fastest (40s) |
| GLM-4.5-Air | **2.4× baseline** | ✅ 1.3× above | Slow (960s) |
| Qwen2.5-72B | **2.2× baseline** | ✅ 1.2× above | Medium (161s) |
| Heuristic | 1.0× (baseline) | ❌ Below | N/A |

#### Key Findings

1. **All LLMs beat the heuristic by 2-4.6×** — confirming the environment measures hallucination resistance
2. **Groq qwen/qwen3-32b achieves the highest score (0.51)** — best overall performance
3. **Groq Llama 3.3 70B** — best speed/quality tradeoff (0.45 in 45s)
4. **Groq Llama 3.1 8B** — impressive for an 8B model (0.42)
5. **All LLMs exceed hackathon requirement (≥0.20)** — by 1.2-2.5×

#### Reproducibility Notes

| Server | Reproducible? | Notes |
|--------|---------------|-------|
| Local (localhost:7860) | ✅ Yes | No other clients, same seed = same scores |
| HuggingFace Spaces | ❌ Varies | Shared server, other requests affect random state |

For **strictly reproducible** benchmark scores:
```bash
# 1. Start fresh local server
uvicorn server.app:app --port 7860

# 2. Run with same seed
python inference.py --heuristic --env-url http://localhost:7860 --seed 42
```

#### Running with HuggingFace Router (Recommended)

```bash
# Set environment variables
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=hf_your_token_here

# Run inference
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
```

#### Running with Groq (Cloud - Best Performance)

```bash
# Set environment variables
export API_BASE_URL=https://api.groq.com/openai/v1
export MODEL_NAME=qwen/qwen3-32b
export HF_TOKEN=gsk_your_key_here

# Run inference
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
```

#### Running with OpenRouter (Cloud)

```bash
# Set environment variables
export API_BASE_URL=https://openrouter.ai/api/v1
export MODEL_NAME=nvidia/nemotron-3-super-120b-a12b:free  # or z-ai/glm-4.5-air:free
export HF_TOKEN=sk-or-v1-your_key_here

# Run inference
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
```

---

## 💻 Development

### Run Locally

```bash
# Clone and install
git clone https://huggingface.co/spaces/SamSankar/hallucination-guard-env
cd hallucination-guard-env
pip install -r server/requirements.txt

# Run server
uvicorn server.app:app --reload --port 7860

# Run tests
pytest tests/

# Run baseline (heuristic, no API key)
python inference.py --heuristic
```

### Run with LLM API

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=hf_...
python inference.py --episodes 3 --steps 5
```

---

## 🔌 Integration Examples

### OpenAI SDK Integration

```python
# examples/openai_integration.py
from openai import OpenAI
import requests

client = OpenAI()
ENV_URL = "https://samsankar-hallucination-guard-env.hf.space"

def evaluate_with_gpt4(question: str, context: str) -> dict:
    # Get answer from GPT-4
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Answer ONLY from context.\n\nContext: {context}\n\nQuestion: {question}\n\n"
                       f"Return JSON: {{'answer': '...', 'confidence': 0.XX, 'source_quote': '...'}}"
        }],
        temperature=0.1
    )

    # Parse and submit to environment
    import json
    result = json.loads(response.choices[0].message.content)

    step = requests.post(f"{ENV_URL}/step", json={
        "answer": result["answer"],
        "confidence": result["confidence"],
        "source_quote": result["source_quote"]
    })

    return step.json()

# See examples/openai_integration.py for full implementation
```

### Anthropic Claude Integration

```python
# examples/anthropic_integration.py
from anthropic import Anthropic
import requests

client = Anthropic()
ENV_URL = "https://samsankar-hallucination-guard-env.hf.space"

def evaluate_with_claude(question: str, context: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Answer using ONLY the provided context.\n\nContext: {context}\n\nQuestion: {question}"
        }]
    )

    # Submit to environment
    step = requests.post(f"{ENV_URL}/step", json={
        "answer": response.content[0].text,
        "confidence": 0.8,
        "source_quote": ""
    })

    return step.json()

# See examples/anthropic_integration.py for full implementation
```

### Batch Evaluation

```bash
# Run batch evaluation across all tasks
python examples/batch_evaluation.py --episodes 5 --output results.json
```

---

## 🚀 Production Deployment

### Docker Compose (Multi-Service)

```yaml
# docker-compose.yml
version: '3.8'

services:
  hallucination-guard:
    build: .
    ports:
      - "7860:7860"
    environment:
      - PYTHONUNBUFFERED=1
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
      interval: 30s
      timeout: 15s
      retries: 3
      start_period: 300s
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 2G

  # Optional: Redis for session caching
  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
```

### Kubernetes Deployment

```yaml
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hallucination-guard-env
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hallucination-guard
  template:
    metadata:
      labels:
        app: hallucination-guard
    spec:
      containers:
      - name: server
        image: hallucination-guard:latest
        ports:
        - containerPort: 7860
        resources:
          limits:
            memory: "4Gi"
            cpu: "2"
          requests:
            memory: "2Gi"
            cpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 7860
          initialDelaySeconds: 300
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 7860
          initialDelaySeconds: 60
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: hallucination-guard-service
spec:
  selector:
    app: hallucination-guard
  ports:
  - port: 80
    targetPort: 7860
  type: LoadBalancer
```

### Environment Configuration

| Variable | Description | Default |
|----------|-------------|---------|
| `USE_LARGE_NLI` | Use large NLI model (more accurate, more memory) | `false` |
| `MAX_QUESTIONS` | Maximum questions per episode | `10` |
| `LOG_LEVEL` | Logging level | `INFO` |

---

## 🔗 Links

| | |
|---|---|
| 🤗 HuggingFace Space | https://huggingface.co/spaces/SamSankar/hallucination-guard-env |
| 📦 PyPI Package | https://pypi.org/project/openenv-halluguard/ |
| 📖 Interactive API Docs | https://samsankar-hallucination-guard-env.hf.space/docs |
| 🏆 Leaderboard | https://samsankar-hallucination-guard-env.hf.space/leaderboard |
| 🔧 OpenEnv Framework | https://github.com/meta-pytorch/OpenEnv |

---

## 🏆 Why This Environment Stands Out

| | |
|---|---|
| **Real-world origin** | Born from an actual AI hallucination experience during hackathon research |
| **Solves the #1 LLM problem** | Hallucination is the most critical reliability issue in production AI |
| **Novel** | First OpenEnv environment targeting hallucination and grounding |
| **Research-grade grader** | ROUGE + BERTScore + AlignScore + nli-deberta-v3-base — publication quality |
| **1M+ diverse examples** | 38 real-world datasets: SQuAD, HaluEval, TruthfulQA, HotpotQA, MedQA and more |
| **Model-agnostic** | Works with GPT-4, Claude, Llama, Mistral, Gemma, Phi, or any LLM |
| **PyPI package** | `pip install openenv-halluguard` for instant SDK access |
| **Production-ready** | Session management, leaderboard, persistent cache, Dockerfile |
| **Adaptive** | ELO-based curriculum scales difficulty with the agent's skill |

---

## Changelog

### v4.2.0 (2026-03)

- **Added** Batch evaluation endpoint (`POST /batch/evaluate`) — evaluate 100+ Q&A pairs in one request
- **Added** Streaming batch endpoint (`POST /batch/stream`) — NDJSON streaming for large batches
- **Added** Time-per-step metrics (`GET /metrics/timing`) — latency percentiles by difficulty
- **Added** Leaderboard visualization (`GET /leaderboard/viz`) — bar chart, scatter plot, performance tiers
- **Added** "I don't know" refusal handling — rewards proper refusals on unanswerable questions
- **Added** Hallucination explanations — human-readable explanations in grader output
- **Added** 18 adversarial test cases — HaluEval, TruthfulQA, edge cases
- **Added** 15 endpoint integration tests — batch, metrics, leaderboard
- **Added** GitHub Actions CI — automated testing on push/PR
- **Fixed** All test suite — 80 tests passing (was broken)

### v4.1.0 (2026-03)

- **Fixed** HF Spaces restart loop — optimized startup with lazy dataset loading
- **Fixed** Missing `_torch_available()` function in grader
- **Fixed** Reproducibility — seed now properly resets dataset sampling for consistent results
- **Reduced** core datasets from 15 to 5 for faster cold starts
- **Increased** healthcheck start-period to 300s for dataset downloads
- **Added** stderr logging for progress visibility in HF Space logs
- **Added** `GET /tasks` — lists all 3 tasks + action schema (OpenEnv required)
- **Added** `POST /grader` — per-episode task scoring 0.0–1.0 (OpenEnv required)
- **Added** `POST /baseline` — built-in heuristic baseline runner (OpenEnv required)
- **Added** `inference.py` — baseline inference script for hackathon submission
- **Added** `server/tasks.py` — task registry with difficulty-mapped graders
- **Updated** `openenv.yaml` to v4.1.0 with task declarations

### v4.0.0

- 9-component reward system (ROUGE + BERTScore + AlignScore)
- NLI upgraded to nli-deberta-v3-base (optimized for HF Spaces)
- 38 datasets, 1,090,163 examples

---

*Built to train models to stop hallucination · MIT License*