Metadata-Version: 2.1
Name: melinda
Version: 0.1.1
Summary: Synthetic data generation.
Home-page: https://github.com/HSE-LAMBDA/LINDA
License: MIT
Keywords: synthetic data,augmentation,generative models,tabular data
Author: Mikhail Hushchyn
Author-email: hushchyn.mikhail@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: jupyterlab (==4.0.5)
Requires-Dist: kaleido (==0.2.1)
Requires-Dist: plotly (==5.16.1)
Requires-Dist: probaforms (==0.2.0)
Requires-Dist: scikit-learn (==1.3.0)
Requires-Dist: sdmetrics (==0.11.0)
Project-URL: Repository, https://github.com/HSE-LAMBDA/LINDA
Description-Content-Type: text/markdown

# Welcome to LINDA

``MELINDA`` is a python library for creating tabular synthetic data. 
It uses various generative models in artificial intelligence 
to learn statistical properties from your real data and 
use them to generate synthetic data.

## Installation
```python
git clone https://github.com/hse-cs/LINDA.git
cd LINDA
pip install -e .
```
or
```python
poetry install
```

## Basic usage
The following code snippet creates an example of real data, fits a generative model, and samples synthetic data.
```python
import numpy as np
import pandas as pd
from melinda.models import ProbaformsSynthesizer
from probaforms.models import CVAE

# generate an example of real data
n = 100
data_real = pd.DataFrame()
data_real['col_1'] = np.random.rand(n)
data_real['col_2'] = np.random.rand(n)
data_real['col_3'] = [str(i) for i in np.random.randint(0, 10, n)]
data_real['col_4'] = [str(i) for i in np.random.randint(0, 5, n)]

num_cols = ['col_1', 'col_2']
cat_cols = ['col_3', 'col_4']
lab_cols = None

# fit a generative model
model = CVAE(latent_dim=10, hidden=(10,), lr=0.001, n_epochs=10)
gen = ProbaformsSynthesizer(model, num_cols, cat_cols, lab_cols, cat_transform='OneHotEncoder')
gen.fit(data_real)

# sample synthetic data
data_synthetic = gen.sample(n_samples=10)
data_synthetic.head()
```
