Metadata-Version: 2.1
Name: oai_dataset_processor
Version: 0.1.1
Summary: A sample dataset processor for evaluating datasets.
Author-email: Ben Gitter <gitterbd@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/gittb/oai_dataset_processor
Project-URL: Bug Tracker, https://github.com/gittb/oai_dataset_processor/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.57.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: pydantic>=2.10.0
Provides-Extra: dev
Requires-Dist: mypy; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# OAI Dataset Processor

**OAI Dataset Processor** is a modular framework for processing large datasets using OpenAI-compatible endpoints. It provides SQL-based job persistence, worker-limited task distribution, and JSON schema validation.

## Installation

```bash
pip install oai-dataset-processor
```

## Key Features
- **Job Persistence**: Uses SQLite by default, configurable to any SQLAlchemy database
- **Bulk Processing**: Process multiple samples through OpenAI-compatible endpoints
- **Async Execution**: Semaphore-based worker limits for efficient job execution
- **JSON Schema Validation**: Enforce structured outputs using JSON schemas
- **Progress Monitoring**: Live progress bar for async tasks
- **Extensibility**: Easy to extend for custom storage or processing logic

## Quick Start

```python
from dataset_processor import OpenAIDatasetProcessor, create_runner_sample
from pydantic import BaseModel

# Define output schema
class SampleResponse(BaseModel):
    grade: int
    coherence: int

# Prepare samples
samples = [
    "The quick brown fox jumps over the lazy dog.",
    "What day today?",
    "The illusion of knowledge is the barrier to discovery.",
    "gpus go burrr"
]

job_samples = [
    create_runner_sample(
        job_id="job_123",
        model_name="gpt-4",
        instructions="Grade the sentence for grammar and coherence (1-10 each)",
        input_data=sample,
        output_json_schema=SampleResponse.model_json_schema(),
        sample_id=idx
    ) for idx, sample in enumerate(samples)
]

# Process samples
processor = OpenAIDatasetProcessor(
    base_url="YOUR_BASE_URL_HERE",
    api_key="YOUR_API_KEY_HERE",
    workers=20
)

processor.ingest_samples(job_samples)
results = processor.run_job("job_123")

# Export results
results.to_jsonl("output_results.jsonl")
print(processor.get_job_status("job_123"))
```

## Configuration

- **Database**: Default `sqlite:///datasetrunner.sqlite`. Configure via `db_url` in `OpenAIDatasetProcessor`
- **Parallelism**: Set concurrent workers via the `workers` parameter
- **Schema Validation**: Define output schemas using Pydantic models

## Dependencies
- `openai`
- `tqdm`
- `pandas`
- `sqlalchemy`
- `pydantic`

## Contributing
Contributions welcome! Please submit PRs for features, optimizations or documentation.
