Metadata-Version: 2.4
Name: realbench
Version: 0.1.0
Summary: Real-world benchmark for Generative AI evaluation
Home-page: https://github.com/Ratnaditya-J/RealBench
Author: RealBench Team
Author-email: contact@realbench.org
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: click>=8.0.0
Requires-Dist: pydantic>=1.9.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: tabulate>=0.8.9
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: seaborn>=0.11.2
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: nltk>=3.8
Requires-Dist: spacy>=3.4.0
Requires-Dist: transformers>=4.20.0
Requires-Dist: torch>=1.12.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Provides-Extra: api
Requires-Dist: fastapi>=0.68.0; extra == "api"
Requires-Dist: uvicorn>=0.15.0; extra == "api"
Requires-Dist: requests>=2.25.0; extra == "api"
Provides-Extra: web
Requires-Dist: streamlit>=1.10.0; extra == "web"
Requires-Dist: plotly>=5.0.0; extra == "web"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# RealBench: Real-world Benchmark for Generative AI

## 🎯 Mission
RealBench addresses the critical gap in AI evaluation by testing models on practical, real-world tasks that humans actually use AI for, emphasizing consistency, practical utility, and robust handling of edge cases.

## 🔍 Problem Statement
Current AI benchmarks fail to capture real-world usage patterns. Models can solve Math Olympiad problems but fail at basic high school math. They excel at specialized tasks but struggle with everyday practical applications. RealBench bridges this gap.

## 📊 Benchmark Categories

### 1. **RealBench-Professional** 
*Workplace and business-oriented tasks*
- Email composition with context awareness
- Report analysis and summarization
- Meeting notes to action items
- Code review and documentation
- Project planning and estimation
- Customer support responses
- Technical troubleshooting

### 2. **RealBench-Daily**
*Everyday life and personal tasks*
- Recipe adaptation with dietary restrictions
- Travel planning with constraints
- Personal finance advice
- Home improvement guidance
- Health and wellness questions
- Shopping comparisons
- Schedule optimization

### 3. **RealBench-Creative**
*Content generation and artistic tasks*
- Story continuation with consistency
- Marketing copy variations
- Social media content adaptation
- Educational content creation
- Creative writing prompts
- Image description generation
- Brand voice matching

### 4. **RealBench-Technical**
*Engineering and scientific tasks*
- Debug code with incomplete context
- System design from requirements
- Data analysis interpretation
- Algorithm optimization
- Security vulnerability assessment
- Performance troubleshooting
- API documentation generation

### 5. **RealBench-Academic**
*Educational and research tasks*
- Homework help with learning focus
- Research paper summarization
- Concept explanation at different levels
- Study guide creation
- Citation formatting
- Literature review assistance
- Exam preparation strategies

### 6. **RealBench-Safety**
*Safety-critical and edge cases*
- Harmful request rejection
- Misinformation detection
- Bias recognition
- Privacy-preserving responses
- Emergency situation guidance
- Medical disclaimer awareness
- Legal limitation acknowledgment

## 🎪 Key Features

### Consistency Testing
- Same concept tested across multiple difficulty levels
- Cross-domain knowledge integration
- Multi-turn conversation coherence

### Practical Metrics
- Task completion rate
- Consistency score
- Uncertainty calibration
- Hallucination detection
- Response appropriateness

### Real-world Alignment
- Based on actual user queries
- Includes ambiguous scenarios
- Tests for "I don't know" responses
- Measures practical helpfulness

## 📁 Project Structure
```
RealBench/
├── README.md
├── categories/
│   ├── professional/
│   ├── daily/
│   ├── creative/
│   ├── technical/
│   ├── academic/
│   └── safety/
├── src/
│   ├── __init__.py
│   ├── benchmark_runner.py
│   ├── evaluators/
│   ├── generators/
│   └── metrics/
├── data/
│   ├── tasks/
│   ├── prompts/
│   └── responses/
├── tests/
├── scripts/
└── results/
```

## 🚀 Getting Started

### Installation
```bash
pip install realbench
```

### Quick Start
```python
from realbench import RealBenchmark

# Initialize benchmark
benchmark = RealBenchmark()

# Run specific category
results = benchmark.run(
    model="gpt-4",
    categories=["professional", "daily"]
)

# View detailed metrics
benchmark.analyze(results)
```

## 📈 Evaluation Metrics

1. **Accuracy**: Correctness of responses
2. **Consistency**: Stability across similar queries
3. **Completeness**: Task completion rate
4. **Appropriateness**: Context-aware responses
5. **Safety**: Harmful content avoidance
6. **Calibration**: Uncertainty expression
7. **Efficiency**: Token usage optimization

## 🤝 Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.

## 📄 License
MIT License

## 🌟 Citation
```bibtex
@misc{realbench2024,
  title={RealBench: A Practical Real-world Benchmark for Generative AI},
  author={RealBench Team},
  year={2024},
  url={https://github.com/username/RealBench}
}
```
