Metadata-Version: 2.1
Name: dbsp
Version: 0.1.0
Summary: Distributed LLM Evaluation Benchmark Server
Author: TJUNLP
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tqdm
Requires-Dist: pandas
Requires-Dist: jsonlines
Requires-Dist: requests

# dbsp - Distributed LLM Evaluation Benchmark Server

[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

A distributed benchmark server for evaluating Large Language Models (LLMs) across multiple benchmarks.

## Features

- **Distributed Evaluation**: Support concurrent worker processes
- **Multiple Benchmarks**: ARC, ETHICS, MMLU-Pro, BiPaR, and more
- **Rate Limiting**: Token bucket based rate limiting for API calls
- **Batch Processing**: Process multiple model-benchmark pairs in batch
- **Resume Support**: Continue interrupted tasks with `--run_id`

## Installation

```bash
pip install dbsp
```

Or install from source:

```bash
pip install -e .
```

## Quick Start

### Single Task

```bash
# Generate answers and evaluate
dbsp gen_eval -b ARC -m qwen-plus -w 10
```

### Batch Processing

Create a task configuration file `tasks.json`:

```json
[
    {"model": "qwen-plus", "benchmark": "ARC_parquet"},
    {"model": "deepseek", "benchmark": "ethics"}
]
```

Then run:

```bash
dbsp batch --config tasks.json -w 10
```

### Resume Interrupted Task

```bash
dbsp gen_eval --run_id 20260210-164048-322281
```

## Commands

### gen_eval

Generate model answers and optionally evaluate.

```bash
dbsp gen_eval [options]

Options:
  -b, --benchmark     Benchmark name (e.g., ARC, ethics)
  -m, --model         Model name (e.g., qwen-plus, deepseek)
  -w, --worker_nums   Number of concurrent workers (default: 6)
  --batch_size        Batch size (default: 1)
  --rate_limit        Token bucket fill rate (default: 5.0)
  --bucket_capacity   Token bucket capacity (default: 20.0)
  --limited_test      Quick test with first 50 questions
  --evaluate         Run evaluation after generation
  --run_id          Resume from run_id
```

### evaluate

Run evaluation on existing results.

```bash
dbsp evaluate -b ARC -r 20260210-164048-322281_qwen-plus
```

### list

List available benchmarks or models.

```bash
dbsp list benchmark
dbsp list model
```

### batch

Batch process tasks from config file.

```bash
dbsp batch --config tasks.json
```

## Project Structure

```
dbsp/
├── llm/                  # LLM model implementations
│   ├── qwen_plus/
│   ├── deepseek/
│   └── ...
├── mep_client/           # Client for generating answers
│   ├── run/
│   └── API/
├── method_server/      # Benchmark implementations
│   ├── ARC_parquet/
│   ├── ethics/
│   ├── BiPaR/
│   └── ...
└── dbsp_cli.py       # CLI entry point
```

## Adding New Models

1. Create folder under `llm/<your_model/`
2. Implement `model.py` inheriting from `BaseModel`
3. Create `model_card.json`
4. Register in `llm/config.py`

## Adding New Benchmarks

1. Create folder under `method_server/<benchmark_name>/`
2. Implement `prepare_ques.py` - Prepare questions and golden answers
3. Implement `eval_script.py` - Run evaluation
4. Create `dataset_card.json`

## License

MIT License
