Metadata-Version: 2.4
Name: finesse-benchmark-database
Version: 0.1.14
Summary: Data generation factory for atomic probes in Finesse benchmark. Generates probes_atomic.jsonl from Wikimedia Wikipedia.
License-File: licence
Author: winter.sci.dev
Author-email: enzoescipy@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: datasets (>=4.3.0,<5.0.0)
Requires-Dist: torch (>=2.1.0,<3.0.0)
Requires-Dist: transformers (>=4.35.0,<5.0.0)
Description-Content-Type: text/markdown

---
license: apache-2.0
---

[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/enzoescipy/finesse-benchmark)
[![huggingface](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/datasets/enzoescipy/finesse-benchmark-database)
[![PyPI](https://img.shields.io/badge/PyPI-Package-green?logo=pypi)](https://pypi.org/project/finesse-benchmark/)
[![Blog](https://img.shields.io/badge/Blog-Article-orange?logo=medium)](https://www.winter-sci-dev.com/posts/embed-sequence-merger-vbert-ppe-article/)


# Finesse Benchmark Database

## Overview

`finesse-benchmark-database` is a data generation factory for atomic probes in the Finesse benchmark. It generates `probes_atomic.jsonl` files from Wikimedia Wikipedia datasets, leveraging Hugging Face's `datasets` library, tokenizers from `transformers`, and optional PyTorch support.

This tool is designed to create high-quality, language-specific probe datasets for benchmarking fine-grained understanding in NLP tasks.

## Installation

Install the package from PyPI:

```bash
pip install finesse-benchmark-database
```

Ensure you have Python 3.10+ installed.

## Usage

Here's a complete example of how to configure and generate a dataset:

```python
from finesse_benchmark_database.config import ProbeConfig
from finesse_benchmark_database.main import generate_dataset

# Define the configuration
my_config = ProbeConfig(
    languages=['en', 'ko'],  # Languages to generate probes for
    samples_per_language=10,  # Number of samples per language (reduce for testing)
    output_file='my_first_probes.jsonl',  # Output file path
    seed=123  # Random seed for reproducibility
)

# Generate the dataset
print(f"Generating '{my_config.output_file}'...")
generate_dataset(my_config)
print("Dataset generation completed!")
```

### Configuration Options

- `languages`: List of language codes (e.g., ['en', 'ko', 'fr']).
- `samples_per_language`: Number of probe samples to generate per language.
- `output_file`: Path to the output JSONL file.
- `seed`: Optional seed for deterministic results.

## Output Format

The output file (`probes_atomic.jsonl`) is a JSON Lines file where each line is a JSON object representing a probe sample. The structure is as follows:

- **source**: An object containing metadata about the origin of the probe.
  - `dataset`: The source dataset, e.g., "wikimedia/wikipedia".
  - `article_id`: The unique identifier of the article, e.g., "5438".
  - `lang`: The language code, e.g., "en".
- **beads**: An array of strings, where each string is a chunk ("bead") of the article text, processed into atomic units for probing.

Example:
```json
{
  "source": {
    "dataset": "wikimedia/wikipedia",
    "article_id": "5438",
    "source_article_hash": "...article_hash...",
    "lang": "en"
  },
  "beads": [
    "Capricorn ( pl. capricorns )...",
    // more chunks
  ]
}
```

## Requirements

- `datasets`
- `transformers`
- `torch` (for tokenization)

## License

This project is licensed under the Apache-2.0 license.

## Contributing

Contributions are welcome! Please open issues or pull requests on the [GitHub repository](https://github.com/your-repo/finesse-benchmark-database).
