Metadata-Version: 2.4
Name: inceptbench
Version: 2.0.0
Summary: Comprehensive benchmark and evaluation framework for educational AI question generation
License: Proprietary - Copyright Trilogy Education Services
Keywords: education,evaluation,ai,questions,assessment,benchmark,edubench,scaffolding
Author: Trilogy Team
Author-email: stanislav.huseletov@trilogy.com
Requires-Python: >=3.11,<3.14
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Education
Requires-Dist: anthropic (>=0.70.0)
Requires-Dist: click (>=8.1.7,<9.0.0)
Requires-Dist: langchain-openai (>=0.1.0)
Requires-Dist: openai (>=1.100.2)
Requires-Dist: pandas (>=2.0.0)
Requires-Dist: pydantic (>=2.0.0)
Requires-Dist: python-dotenv (>=1.0.0)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Requires-Dist: supabase (>=2.0.0)
Requires-Dist: tiktoken (>=0.11.0)
Requires-Dist: torch (>=2.8.0)
Requires-Dist: tqdm (>=4.66.0)
Requires-Dist: transformers (>=4.55.3)
Project-URL: Homepage, https://github.com/incept-ai/inceptbench
Project-URL: Repository, https://github.com/incept-ai/inceptbench
Description-Content-Type: text/markdown

# InceptBench

[![PyPI version](https://badge.fury.io/py/inceptbench.svg)](https://badge.fury.io/py/inceptbench)
[![Python Version](https://img.shields.io/pypi/pyversions/inceptbench.svg)](https://pypi.org/project/inceptbench/)
[![License: Proprietary](https://img.shields.io/badge/License-Proprietary-red.svg)](LICENSE)

Educational content evaluation framework with multiple AI-powered assessment modules.

> ### ⚠️ DEPRECATION NOTICE - Action Required
> 
> **The legacy evaluator (v1.5.x) is deprecated and will be removed on December 6, 2025.**
> 
> After this date, the `--new` flag behavior will become the default.
> 
> **Migration:** Update your integrations now by adding `--new` to your commands:
> ```bash
> # Before (legacy - deprecated)
> inceptbench evaluate qs.json
> 
> # After (v2.0 - recommended)
> inceptbench evaluate qs.json --new
> ```
> 
> See [Migration Guide](./docs/USAGE.md#migration-guide) for details.

## 📖 Documentation

### Official Sites
[Website](https://bench.inceptapi.com/) • [Benchmarks](https://bench.inceptapi.com/benchmarks/) • [Glossary](https://bench.inceptapi.com/glossary/) • [Docs](https://bench.inceptapi.com/inceptbench-docs/) • [API Endpoint](https://uae-poc.inceptapi.com/evaluate) • [API Docs](https://uae-poc.inceptapi.com/docs)

### User Guides
- **[USAGE.md](./docs/USAGE.md)** - Installation, configuration, CLI & Python API
- **[INPUT_OUTPUT.md](./docs/INPUT_OUTPUT.md)** - Input schemas and output formats
- **[EVALUATORS.md](./docs/EVALUATORS.md)** - Complete evaluator reference

### Developer Guides
- **[WIKI.md](./docs/WIKI.md)** - Documentation hub and workflows
- **[MAINTAINERS.md](./docs/MAINTAINERS.md)** - Submodule maintainer guide
- **[PUBLISHING.md](./docs/PUBLISHING.md)** - Package publishing workflow
- **[VERSION_LOCATIONS.md](./docs/VERSION_LOCATIONS.md)** - Version file reference

### Resources
- **[Google Drive](https://drive.google.com/drive/folders/1dFdMj70HgYZCtrMG3W1_3lVyi8Kmyz_V)** - Test data and examples
- **[GitHub Repo](https://github.com/trilogy-group/inceptbench)** - Source code

## 🚀 Quick Start

```bash
# Install from PyPI (latest published release)
pip install inceptbench

# Or install from source (current repo snapshot)
git clone https://github.com/incept-ai/inceptbench.git
cd inceptbench
python3 -m venv venv && source venv/bin/activate
pip install -e .

# Create .env file (required for evaluation)
echo "OPENAI_API_KEY=your_key" >> .env
echo "ANTHROPIC_API_KEY=your_key" >> .env

# Generate example
inceptbench example

# Run evaluation - Legacy system (v1.5.5)
inceptbench evaluate qs.json --full

# Run evaluation - NEW system (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new

# Advanced mode - Evaluate raw files directly
inceptbench evaluate article.md --new --advanced

# Or call the CLI module directly (no install needed)
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench evaluate qs.json --new
```

## 🆕 Two Evaluation Systems

InceptBench offers two evaluation systems:

### Legacy System (v1.5.5) - **DEPRECATED**
⚠️ The legacy evaluator will be removed in a future release.

```bash
# Legacy evaluation (default, no flags)
inceptbench evaluate qs.json
```

### New System (v2.0.0) - **RECOMMENDED**
🚀 Enhanced hierarchical evaluation with detailed reasoning.

```bash
# Standard mode: Structured JSON input
inceptbench evaluate qs.json --new

# Advanced mode: Raw file/folder input
inceptbench evaluate article.md --new --advanced
inceptbench evaluate ./lessons/ --new --advanced
```

**Benefits of v2.0.0:**
- ✅ **Hierarchical Evaluation** - Questions, quizzes, articles with nested content
- ✅ **Detailed Reasoning** - See why content received each score
- ✅ **Actionable Suggestions** - Specific improvements for each metric
- ✅ **Better Error Handling** - Individual failures don't crash entire batch
- ✅ **Advanced Mode** - Evaluate raw files without JSON structuring

**Migration Guide:**
Simply add `--new` to your existing commands:
```bash
# Before (legacy)
inceptbench evaluate qs.json

# After (new system)
inceptbench evaluate qs.json --new
```

## ✨ Features

- **6 Specialized Evaluators** - Quality assessment across multiple dimensions
- **Automatic Image Evaluation** - Context-aware DI rubric scoring
- **Parallel Processing** - 47+ tasks running concurrently
- **Multi-language Support** - Evaluate content in any language
- **Hierarchical Content** - Evaluate nested structures (articles with quizzes/questions)
- **Raw File Support** - Advanced mode for direct file/folder evaluation
- **Production-Ready** - Full demo in `qs.json` (~3-4 minutes)

## 📊 Evaluators

| Evaluator | Type | Auto |
|-----------|------|------|
| ti_question_qa | Question quality (10 dimensions) | Yes |
| answer_verification | Answer correctness | Yes |
| reading_question_qc | MCQ distractor analysis | Yes |
| math_content_evaluator | Content quality (9 criteria) | Yes |
| text_content_evaluator | Pedagogical text assessment | Yes |
| image_quality_di_evaluator | DI rubric image quality | **Auto** |
| external_edubench | Educational benchmark (6 tasks) | No |

See [EVALUATORS.md](./docs/EVALUATORS.md) for details.

## 📦 Architecture

```
inceptbench/
├── src/inceptbench/          # Unified package (src/ layout)
│   ├── orchestrator.py        # Main evaluation orchestrator
│   ├── cli.py                 # Command-line interface
│   ├── core/                  # Core evaluators and utilities
│   ├── agents/                # Agent-based evaluators
│   ├── qc/                    # Quality control modules
│   ├── evaluation/            # Evaluation templates
│   └── image/                 # Image quality evaluation
├── submodules/                # External dependencies
│   ├── reading-question-qc/
│   ├── EduBench/
│   ├── agentic-incept-reasoning/
│   └── image_generation_package/
└── pyproject.toml             # Package configuration
```

## 🎯 Demo

The `qs.json` file demonstrates all capabilities:
- 8 questions (MCQ/fill-in, Arabic/English)
- 4 text content items
- 7 images (auto-evaluated)
- All 6 evaluators active
- ~3-4 minute runtime

## ✅ Local Smoke Test

Use the bundled demo file to validate your environment before making changes:

```bash
# Using new evaluator (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new

# Using legacy evaluator (v1.5.5)
inceptbench evaluate qs.json --full

# Or run locally without installing the package
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench evaluate qs.json --new

# Or using Python API (legacy)
python -c "from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest; import json; data = json.load(open('qs.json')); request = UniversalEvaluationRequest(**data); result = universal_unified_benchmark(request); print(result.model_dump_json(indent=2))"
```

These commands exercise the evaluator and report per-item scores plus the `inceptbench_version` (1.5.5 for legacy, 2.0.0 for new). Sample data leaves some `image_url` fields set to `null`, so the DI image checker will log `FileNotFoundError: 'null'` entries—those are expected for the placeholders and can be ignored during the smoke test.

## 🌐 Locale-Aware Localization

`UniversalEvaluationRequest` now accepts a `locale` such as `ar-AE`, `en-AE`, or `en-IN`. The format is:

- **First segment** (`ar`, `en`, etc.): language of the text
- **Second segment** (`AE`, `IN`, etc.): cultural/regional guardrails to apply

When `locale` is provided, all localization checks use the corresponding language + cultural context. If it is omitted, we fall back to the legacy `language` field and heuristics (auto-detecting non-ASCII text when necessary).

Localization now runs for **every** item (including English) so cultural guardrails are always enforced; locale/language metadata simply control which prompts fire. Localized prompts run through a dedicated `localization_evaluator`, making cultural QA a first-class signal rather than a side-effect of other evaluators. Technical checks (schema fidelity, grammar, etc.) live in other modules—this evaluator focuses only on cultural neutrality and regional appropriateness.

**Rule-based regionalization checks (ITD guidance):**
- Familiarity & relevance: keep contexts understandable for the target region/grade (no “filing taxes” for Grade 3, no hyper-local fruit for remote regions).
- Regional reference limit: at most one explicit local prop—multiple props often create caricatures.
- Instruction-aligned language: only switch languages when the student’s classroom instruction uses that language (respect bilingual/international settings).
- Respectful tone & content: references must not mock, stereotype, or oversimplify cultures; neutral fallbacks beat risky flair.
- Rule-first transparency: every failure cites the violated rule, favoring deterministic guardrails over fuzzy similarity scores.

All localization guardrails live in `src/inceptbench/agents/localization_guidelines.json`, so future tweaks are data-only—add new cultural rules/prompts in JSON and the evaluator automatically picks them up without code changes.

Each rule is scored via its own compact prompt that returns `0` (fail) or `1` (pass); section and overall scores are simply the percentage of guardrail rules satisfied, so localization quality is now a transparent, deterministic checklist.

## 📝 Example Usage

### CLI - Standard Mode
```bash
# New evaluator (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new
inceptbench evaluate qs.json --new -o results.json
inceptbench evaluate qs.json --new --max-threads 20

# Legacy evaluator (v1.5.5) - DEPRECATED
inceptbench evaluate qs.json --full
inceptbench evaluate qs.json -o results.json
```

### CLI - Advanced Mode (Raw Files)
```bash
# Evaluate a single file
inceptbench evaluate article.md --new --advanced
inceptbench evaluate lesson.txt --new --advanced -o result.json

# Evaluate all files in a folder
inceptbench evaluate ./lessons/ --new --advanced
inceptbench evaluate ./content/ --new --advanced --max-threads 5 -o batch.json
```

**Advanced Mode Features:**
- No JSON structuring required - just pass raw text files
- Supports markdown, text, HTML, or any text-based format
- Automatic content type detection
- Batch processing for folders
- Output keyed by filename

**Example Output (Advanced Mode):**
```json
{
  "request_id": "abc123...",
  "evaluations": {
    "article.md": {
      "inceptbench_new_evaluation": {
        "content_type": "article",
        "overall": {
          "score": 0.85,
          "reasoning": "Well-structured with clear explanations...",
          "suggested_improvements": "Add more practice problems..."
        },
        "factual_accuracy": { ... },
        // ... all metrics
      },
      "score": 0.85
    }
  },
  "evaluation_time_seconds": 45.3,
  "inceptbench_version": "2.0.0"
}
```

### Python API
```python
from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],
    generated_questions=[{
        "id": "q1",
        "type": "mcq",
        "question": "What is 2+2?",
        "answer": "4",
        "answer_options": {"A": "3", "B": "4", "C": "5"},
        "answer_explanation": "2+2 equals 4",
        "skill": {
            "title": "Basic Addition",
            "grade": "1",
            "subject": "mathematics",
            "difficulty": "easy"
        }
    }]
)

response = universal_unified_benchmark(request)
print(response.evaluations["q1"].score)
```

See [USAGE.md](./docs/USAGE.md) for complete examples.

## 🖼️ Image Evaluation

Add `image_url` to any question or content:
```json
{
  "id": "q1",
  "question": "How many apples?",
  "image_url": "https://example.com/apples.png"
}
```

The `image_quality_di_evaluator` runs automatically with:
- Context-aware evaluation (accompaniment vs standalone)
- DI rubric scoring (0-100, normalized to 0-1)
- Hard-fail gates (answer leakage, wrong representations)
- Canonical DI representation checks

## 📥 Input Format

**Questions**:
```json
{
  "submodules_to_run": ["ti_question_qa"],
  "generated_questions": [{
    "id": "q1",
    "type": "mcq",
    "question": "...",
    "answer": "...",
    "image_url": "..."  // Optional
  }]
}
```

**Text Content**:
```json
{
  "submodules_to_run": ["text_content_evaluator"],
  "generated_content": [{
    "id": "text1",
    "type": "text",
    "content": "...",
    "image_url": "..."  // Optional
  }]
}
```

See [INPUT_OUTPUT.md](./docs/INPUT_OUTPUT.md) for complete schema.

## 📤 Output Format

### Legacy System (v1.5.5)

**Simplified** (default):
```json
{
  "evaluations": {
    "q1": {"score": 0.89}
  },
  "inceptbench_version": "1.5.5"
}
```

**Full** (verbose=True):
```json
{
  "evaluations": {
    "q1": {
      "ti_question_qa": {
        "overall": 0.95,
        "scores": {...},
        "issues": [...],
        "strengths": [...]
      },
      "score": 0.89
    }
  },
  "inceptbench_version": "1.5.5"
}
```

### New System (v2.0.0)

**Response Structure:**

Every evaluation in v2.0.0 follows this consistent structure:

1. **Universal Metrics** (all content types):
   - `overall` - Holistic quality assessment
   - `factual_accuracy` - Correctness of all facts and information
   - `educational_accuracy` - Alignment with learning objectives

2. **Content-Specific Metrics** (varies by type):
   - **Questions**: `clarity_precision`, `difficulty_appropriateness`, `distractor_quality`, `answer_explanation_quality`, `curriculum_alignment`, `stimulus_quality`, `mastery_learning_alignment`
   - **Quizzes**: Same as questions (evaluated as a collection)
   - **Articles**: `curriculum_alignment`, `teaching_quality`, `worked_examples`, `practice_problems`, `follows_direct_instruction`, `stimulus_quality`, `diction_and_sentence_structure`
   - **Readings**: Reading-specific metrics

3. **Hierarchical Evaluation**:
   - `subcontent_evaluations` - Array of evaluations for nested content (e.g., questions within quizzes/articles)
   - `null` if no nested content exists

4. **Metric Format** (all metrics follow this pattern):
   ```json
   {
     "score": 0.85,  // 0.0 to 1.0 (binary metrics: 0.0 or 1.0)
     "reasoning": "Clear explanation of why this score was given...",
     "suggested_improvements": "Specific actionable suggestions..."  // null if score is 1.0
   }
   ```

**Standard Mode Example:**
```json
{
  "evaluations": {
    "q1": {
      "inceptbench_new_evaluation": {
        "content_type": "question",
        "overall": {
          "score": 0.92,
          "reasoning": "High-quality MCQ with clear stem...",
          "suggested_improvements": "Consider adding..."
        },
        "factual_accuracy": {
          "score": 1.0,
          "reasoning": "All facts are correct...",
          "suggested_improvements": null
        },
        "educational_accuracy": {
          "score": 1.0,
          "reasoning": "Aligns perfectly with grade-level objectives...",
          "suggested_improvements": null
        },
        "clarity_precision": {
          "score": 0.9,
          "reasoning": "Question is clear but could be more concise...",
          "suggested_improvements": "Remove redundant phrase in stem..."
        },
        // ... 6 more content-specific metrics
        "subcontent_evaluations": null
      },
      "score": 0.92
    }
  },
  "inceptbench_version": "2.0.0"
}
```

**Advanced Mode Example (Hierarchical Content):**
```json
{
  "request_id": "def456...",
  "evaluations": {
    "article.md": {
      "inceptbench_new_evaluation": {
        "content_type": "article",
        "overall": {
          "score": 0.85,
          "reasoning": "Well-structured article with good pedagogical flow...",
          "suggested_improvements": "Add more worked examples before practice problems..."
        },
        "factual_accuracy": {
          "score": 1.0,
          "reasoning": "All mathematical concepts are accurate...",
          "suggested_improvements": null
        },
        "educational_accuracy": {
          "score": 0.9,
          "reasoning": "Aligns well with grade 6 standards...",
          "suggested_improvements": "Add explicit connection to 6.RP.A.2..."
        },
        "curriculum_alignment": { "score": 1.0, ... },
        "teaching_quality": { "score": 0.8, ... },
        "worked_examples": { "score": 0.7, ... },
        // ... 4 more article-specific metrics
        "subcontent_evaluations": [
          {
            "content_type": "question",
            "overall": {
              "score": 0.88,
              "reasoning": "Strong practice question...",
              "suggested_improvements": "Add one more distractor..."
            },
            "factual_accuracy": { "score": 1.0, ... },
            "educational_accuracy": { "score": 1.0, ... },
            // ... 7 more question-specific metrics
            "subcontent_evaluations": null
          },
          // ... more embedded questions
        ]
      },
      "score": 0.85
    }
  },
  "evaluation_time_seconds": 67.8,
  "inceptbench_version": "2.0.0"
}
```

**Key Points:**
- **Consistency**: All content types use the same metric structure (score, reasoning, suggestions)
- **Transparency**: Every score includes detailed reasoning
- **Actionable**: Suggestions only appear when score < 1.0
- **Hierarchical**: Nested content (questions in quizzes/articles) fully evaluated
- **Comprehensive**: 10 metrics per content type (3 universal + 7 content-specific)

**Metrics by Content Type:**

| Content Type | Universal Metrics | Content-Specific Metrics (7) |
|--------------|-------------------|------------------------------|
| **Question** | overall, factual_accuracy, educational_accuracy | clarity_precision, difficulty_appropriateness, distractor_quality, answer_explanation_quality, curriculum_alignment, stimulus_quality, mastery_learning_alignment |
| **Quiz** | overall, factual_accuracy, educational_accuracy | Same as Question (evaluated as collection) |
| **Article** | overall, factual_accuracy, educational_accuracy | curriculum_alignment, teaching_quality, worked_examples, practice_problems, follows_direct_instruction, stimulus_quality, diction_and_sentence_structure |
| **Reading** (Fiction/Nonfiction) | overall, factual_accuracy, educational_accuracy | clarity_precision, difficulty_appropriateness, engagement_quality, comprehension_support, stimulus_quality, diction_and_sentence_structure, length_appropriateness |

**Score Types:**
- **Binary** (0.0 or 1.0): curriculum_alignment, follows_direct_instruction, and others where pass/fail is appropriate
- **Continuous** (0.0-1.0): Most metrics that assess quality on a spectrum

**Hierarchical Evaluation Structure:**
```
Article (with embedded quiz)
├── overall: {score, reasoning, suggestions}
├── factual_accuracy: {score, reasoning, suggestions}
├── educational_accuracy: {score, reasoning, suggestions}
├── [7 article-specific metrics]
└── subcontent_evaluations:
    └── Quiz
        ├── overall: {score, reasoning, suggestions}
        ├── factual_accuracy: {score, reasoning, suggestions}
        ├── educational_accuracy: {score, reasoning, suggestions}
        ├── [7 quiz-specific metrics]
        └── subcontent_evaluations:
            ├── Question 1
            │   ├── overall: {score, reasoning, suggestions}
            │   ├── factual_accuracy: {score, reasoning, suggestions}
            │   ├── educational_accuracy: {score, reasoning, suggestions}
            │   ├── [7 question-specific metrics]
            │   └── subcontent_evaluations: null
            └── Question 2
                ├── [same structure]
                └── subcontent_evaluations: null
```

## 🔄 Module Selection

**Automatic** (if `submodules_to_run` not specified):
- Questions → `ti_question_qa`, `answer_verification`, `math_content_evaluator`, `reading_question_qc`
- Text → `text_content_evaluator`, `math_content_evaluator`
- Images → `image_quality_di_evaluator` (auto-added)
- Localization → `localization_evaluator` (auto for all languages; uses locale/language metadata to pick prompts)

**Manual**:
```python
request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],  # Only these
    generated_questions=[...]
)
```

## 🎛️ CLI Flags Reference

### Core Flags
- `--new` - Use new evaluator (v2.0.0) instead of legacy (v1.5.5)
- `--advanced` - Advanced mode for raw file/folder input (requires `--new`)
- `--max-threads N` - Maximum parallel evaluation threads (default: 10)
- `-o, --output FILE` - Save results to file
- `-v, --verbose` - Show progress messages
- `--full` - Return full detailed results (legacy system only)

### Legacy System Only
- `--subject TEXT` - Subject area for routing (math, ela, science, etc.)
- `--grade TEXT` - Grade level (K, 3, 6-8, 9-12, etc.)
- `--type TEXT` - Content type (mcq, fill-in, passage, article, etc.)

### Examples
```bash
# New evaluator - standard mode
inceptbench evaluate qs.json --new

# New evaluator - advanced mode (raw file)
inceptbench evaluate article.md --new --advanced

# New evaluator - batch processing
inceptbench evaluate ./lessons/ --new --advanced --max-threads 20

# Legacy evaluator
inceptbench evaluate qs.json --subject math --grade 6
```

### Version Detection
The `inceptbench_version` field in the output indicates which system was used:
- `"1.5.5"` - Legacy evaluator
- `"2.0.0"` - New evaluator

## 📚 Additional Documentation

- **[ADVANCED_MODE.md](./ADVANCED_MODE.md)** - Complete guide to raw file evaluation
- **[INTEGRATION_SUMMARY.md](./INTEGRATION_SUMMARY.md)** - Technical integration details
- **[NEW_EVALUATOR_OUTPUT_FORMAT.md](./NEW_EVALUATOR_OUTPUT_FORMAT.md)** - Output format reference

## 📜 License

Proprietary - Copyright Trilogy Education Services

