Metadata-Version: 2.4
Name: datatunner
Version: 0.1.2
Summary: Ferramenta para determinar a proporção ideal de dados sintéticos em modelos de ML
Home-page: https://github.com/leandrocrx/datatunner
Author: Leandro Costa Rocha
Author-email: Leandro Costa Rocha <leandro.rocha@example.com>
License: MIT
Project-URL: Homepage, https://github.com/leandrocrx/datatunner
Project-URL: Repository, https://github.com/leandrocrx/datatunner
Project-URL: Issues, https://github.com/leandrocrx/datatunner/issues
Keywords: machine learning,deep learning,synthetic data,data augmentation,neural networks
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# 🎯 DataTunner

**Automatic Optimization of Synthetic Data Proportions for Deep Learning Models**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyPI version](https://badge.fury.io/py/datatunner.svg)](https://badge.fury.io/py/datatunner)

[🇧🇷 Versão em Português](https://github.com/leandrocrx/datatunner/blob/main/LEIAME.md)

## 📖 About

DataTunner is an open-source Python tool that automates the process of determining the optimal proportion of synthetic data to maximize neural network model performance.

The growing demand for large volumes of data for training Deep Learning models has driven the use of synthetic data as a solution to data scarcity and dataset imbalance. DataTunner systematizes this process, which traditionally is empirical and unstructured.

## ✨ Key Features

- 🔄 **Automatic Hybrid Dataset Generation**: Combines real and synthetic data in different proportions
- 🖼️ **Image Support**: CNNs (ResNet, VGG, MobileNet) with data augmentation
- 📊 **Tabular Data Support**: MLPs, Random Forest, XGBoost with SMOTE and CTGAN
- 📈 **Comprehensive Metrics**: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- 🎨 **Rich Visualizations**: Interactive charts and detailed reports
- 🔬 **Guaranteed Reproducibility**: Fixed seeds and controlled hyperparameters
- ⚡ **GPU Support**: CUDA acceleration
- 💾 **Checkpointing**: Resume interrupted experiments

## 🚀 Installation

### Basic Installation

```bash
pip install datatunner
```

### Development Installation

```bash
git clone https://github.com/leandrocrx/datatunner.git
cd datatunner
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -e ".[dev]"
```

## 📚 Quick Start

### Image Example (CIFAR-10)

```python
from datatunner import DataTunner
from datatunner.models.cnn import ResNetClassifier
from datatunner.generators.augmentation import ImageAugmentation

# Configure optimizer
tunner = DataTunner(
    data_type='image',
    real_data_path='data/cifar10/real',
    synthetic_data_path='data/cifar10/synthetic',
    output_dir='results/cifar10'
)

# Define model
model = ResNetClassifier(num_classes=10, architecture='resnet18')

# Run optimization
results = tunner.optimize(
    model=model,
    proportions=[0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0],
    epochs=50,
    batch_size=64,
    n_trials=3  # Repetitions for robustness
)

# Visualize results
tunner.plot_results()
print(f"Best proportion: {results['best_proportion']}")
```

### Tabular Data Example

```python
from datatunner import DataTunner
from datatunner.models.mlp import MLPClassifier
from datatunner.generators.smote import SMOTEGenerator

# Configure optimizer
tunner = DataTunner(
    data_type='tabular',
    real_data_path='data/adult/train.csv',
    test_data_path='data/adult/test.csv',
    output_dir='results/adult'
)

# Generate synthetic data with SMOTE
generator = SMOTEGenerator(k_neighbors=5)
synthetic_data = generator.generate(data=tunner.real_data, n_samples=5000)

# Define model
model = MLPClassifier(hidden_layers=[128, 64, 32], dropout=0.3)

# Run optimization
results = tunner.optimize(
    model=model,
    synthetic_data=synthetic_data,
    proportions=[0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0],
    epochs=100,
    batch_size=128
)

# Detailed report
tunner.generate_report(format='html')
```

## 📊 Methodology

### For Image Datasets

1. **Input**: Real images + synthetic images (augmentation, GANs, diffusion)
2. **Mixing**: Combination in varied proportions with class balancing
3. **Training**: CNNs (ResNet, VGG, MobileNet) with fixed hyperparameters
4. **Evaluation**: Metrics on independent test set

### For Tabular Datasets

1. **Input**: Real tabular data + synthetic (SMOTE, CTGAN)
2. **Mixing**: Preservation of distributions and correlations
3. **Training**: MLPs, Random Forest, XGBoost, LightGBM
4. **Evaluation**: Specific metrics for imbalance

## 🏗️ Architecture

```
datatunner/
├── core/           # Optimization and evaluation engine
├── generators/     # Synthetic data generators
├── models/         # ML/DL models
├── utils/          # Utilities and visualization
└── config/         # Configuration
```

## 🧪 Testing

```bash
pytest tests/ -v --cov=datatunner
```

## 📖 Documentation

See complete examples in the [`examples/`](https://github.com/leandrocrx/datatunner/tree/main/examples) folder and documentation in the repository's Markdown files.

## 🤝 Contributing

Contributions are welcome! Please read our [contribution guide](https://github.com/leandrocrx/datatunner/blob/main/CONTRIBUTING.md).

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](https://github.com/leandrocrx/datatunner/blob/main/LICENSE) file for details.

## 📚 Citation

If you use DataTunner in your research, please cite:

```bibtex
@software{datatunner2026,
  author = {Rocha, Leandro},
  title = {DataTunner: Synthetic Data Proportion Optimization},
  year = {2026},
  url = {https://github.com/leandrocrx/datatunner}
}
```

## 👥 Authors

- **Leandro Costa Rocha** - *Lead Developer*
  - Instagram: [@leandrocr.adv](https://instagram.com/leandrocr.adv)
  - LinkedIn: [leandro-costa-rocha](https://www.linkedin.com/in/leandro-costa-rocha-b40189b0)

## 🙏 Acknowledgments

Based on research about synthetic data optimization for Deep Learning models.

---

**Developed with ❤️ for the Data Science community**
