Metadata-Version: 2.4
Name: dataset-foundry
Version: 0.2.1
Summary: Build, refine and validate AI safety datasets using LLM pipelines
Project-URL: GitHub, https://github.com/fastfedora/dataset-foundry
Author-email: Trevor Lohrbeer <trevor@fastfedora.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: anyio>=4.8.0
Requires-Dist: datason>=0.13.0
Requires-Dist: docker>=7.1.0
Requires-Dist: gitmatch>=0.2.1
Requires-Dist: langchain-anthropic==0.3.10
Requires-Dist: langchain-core==0.3.45
Requires-Dist: langchain-openai==0.3.9
Requires-Dist: langchain==0.3.21
Requires-Dist: mergedeep>=1.3.4
Requires-Dist: pydantic>=2.10.6
Requires-Dist: pytest-asyncio>=1.0.0
Requires-Dist: pytest>=8.3.4
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: pyyaml==6.0.2
Requires-Dist: textual>=6.2.0
Requires-Dist: toolz>=1.0.0
Provides-Extra: dist
Requires-Dist: build; extra == 'dist'
Requires-Dist: twine; extra == 'dist'
Description-Content-Type: text/markdown

# Dataset Foundry

A toolkit for building validated datasets.

Dataset Foundry uses the concept of data pipelines to load, generate or validate datasets. A
pipeline is a sequence of actions executed either against the dataset itself or the individual items
in the dataset.

For details on which actions are supported, see the [actions](docs/actions.md) documentation.

## Setup

1. Clone the repository
2. Create a virtual environment:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```
3. Install the package:
   ```bash
   pip install -e .
   ```
4. Create a `.env` file in the project root with your OpenAI API key:
   ```
   OPENAI_API_KEY=your_api_key_here
   ```

### Default Settings

- Default Model: gpt-4o-mini
- Default Temperature: 0.7
- Default Number of Samples: 10
- Dataset Directory: ./datasets
- Logs Directory: ./logs

## Project Structure

```
dataset-foundry/
├── src/
│   └── dataset_foundry/
│       ├── actions/       # Actions for processing datasets and items within those datasets
│       ├── cli/           # Command-line interface tools
│       ├── core/          # Core functionality
│       └── utils/         # Utility functions
├── datasets/              # Generated datasets
├── examples/              # Example pipelines
│   └── refactorable_code/ # Example pipelines to build a dataset of code requiring refactoring
└── logs/                  # Operation logs
```

## Running Pipelines

Pipelines can be run from the command line using the `dataset-foundry` command.

```bash
dataset-foundry <pipeline_module> <dataset_name>
```

For example, to run the `generate_spec` pipeline to create specs for a dataset saved to
`datasets/dataset1`, you would use:

```bash
dataset-foundry examples/refactorable_code/generate_spec/pipeline.py dataset1
```

Use `dataset-foundry --help` to see available arguments.

### Running Examples

To generate a set of specs for a dataset named `o3v5`, you would use:

```bash
dataset-foundry examples/refactorable_code/generate_spec/pipeline.py samples --num-samples=2
```

To generate a set of functions and unit testsfrom the specs, you would use:

```bash
dataset-foundry examples/refactorable_code/generate_all_from_spec/pipeline.py samples
```

To run the unit tests for the generated functions, you would use:

```bash
dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples
```

If some of the unit tests fail, you can regenerate them by running:

```bash
dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples
```

## Variable Substitutions

Variable substitutions allows you to use variables in your prompts and in certain parameters passed
into pipeline actions.

Prompt templates and certain parameters are parsed as f-strings, with the following enhancements:

- Dotted references are supported and resolve both dictionary keys or object attributes. For
  instance, `{spec.name}` will return the value of `spec['name']` if `spec` is a dictionary, or
  the value of `spec.name` if `spec` is an object.
- Formatters can be specified after a colon. For example, `{spec:yaml}` will return the `spec`
  object formatted as a YAML string. Supported formatters include: `yaml`, `json`, `upper`, `lower`.

For instance, if an item is being processed with an id of `123` and a `spec` dictionary with a
`name` key of `my_function`, the following will save the `code` property of the item as a file named
`item_123_my_function.py`:

```python
   ...
   save_item(contents=Key("code"), filename="item_{id}_{spec.name}.py"),
   ...
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.