Metadata-Version: 2.4
Name: clover-synth
Version: 0.1.0
Summary: Clover: Synthetic Health Data Generation and Validation Library
Author: Katleen Blanchet
Author-email: Yue Qi <yue.qi.chum@ssss.gouv.qc.ca>, Lorrie Herbault <lorrie.herbault.chum@ssss.gouv.qc.ca>
Maintainer-email: Yue Qi <yue.qi.chum@ssss.gouv.qc.ca>, Lorrie Herbault <lorrie.herbault.chum@ssss.gouv.qc.ca>
License: Apache License (2.0)
Keywords: data privacy,healthcare data,synthetic data
Requires-Python: <3.11,>=3.8
Requires-Dist: category-encoders<3.0,>=2.6.1
Requires-Dist: datasynthesizer<0.2.0,>=0.1.11
Requires-Dist: diffprivlib<0.7,>=0.6.4
Requires-Dist: disjoint-set<1.0,>=0.7.4
Requires-Dist: imbalanced-learn<0.13.0,>=0.11.0
Requires-Dist: jinja2<=4.0
Requires-Dist: matplotlib<4.0,>=3.7.1
Requires-Dist: networkx<4.0,>=3.1
Requires-Dist: numpy<2.0,==1.23.5
Requires-Dist: opacus<1.5.0,==1.4.0
Requires-Dist: optuna<4.1.1,>=3.2.0
Requires-Dist: pandas<2.2.3,>=1.5.3
Requires-Dist: pymoo<0.7,==0.6.0.1
Requires-Dist: pytest<8.0
Requires-Dist: ray[tune]<3.0,>=2.5.1
Requires-Dist: scikit-learn<1.5.3,>=1.4
Requires-Dist: scipy<4.66.6,>=1.10.1
Requires-Dist: sdv<1.3.0,>=1.2.1
Requires-Dist: seaborn<1.0,==0.13.2
Requires-Dist: tqdm<5.0,==4.65.0
Requires-Dist: xgboost<2.1.3,>=1.7.5
Provides-Extra: dev
Requires-Dist: black==25.1.0; extra == 'dev'
Requires-Dist: black[jupyter]; extra == 'dev'
Requires-Dist: furo==2023.9.10; extra == 'dev'
Requires-Dist: icecream==2.1.3; extra == 'dev'
Requires-Dist: ipython==8.12.3; extra == 'dev'
Requires-Dist: nbsphinx-link==1.3.0; extra == 'dev'
Requires-Dist: nbsphinx==0.9.1; extra == 'dev'
Requires-Dist: pandoc==2.3; extra == 'dev'
Requires-Dist: pytest==7.3.1; extra == 'dev'
Requires-Dist: sphinx-autodoc-typehints==1.23.0; extra == 'dev'
Requires-Dist: sphinx==6.2.1; extra == 'dev'
Requires-Dist: tomli-w==1.0.0; extra == 'dev'
Requires-Dist: tomli==2.0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Clover: Synthetic Health Data Generation and Validation Library

<div align="center">

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/downloads/)
[![Tests with Pytest](https://img.shields.io/badge/Tests-Pytest-green)](https://pytest.org)
[![Coverage](https://codecov.io/gh/CRCHUM-CITADEL/clover/branch/main/graph/badge.svg)](https://codecov.io/gh/CRCHUM-CITADEL/clover)
[![CI: Pytest](https://github.com/CRCHUM-CITADEL/clover/actions/workflows/pytest.yml/badge.svg)](https://github.com/CRCHUM-CITADEL/clover/actions/workflows/pytest.yml)
[![Code Style: Black](https://img.shields.io/badge/Code%20Style-Black-black)](https://black.readthedocs.io)
[![CI: Black](https://github.com/CRCHUM-CITADEL/clover/actions/workflows/black.yml/badge.svg)](https://github.com/CRCHUM-CITADEL/clover/actions/workflows/black.yml)
[![Docs: Sphinx](https://img.shields.io/badge/Docs-Sphinx-blue)](https://www.sphinx-doc.org)

</div>

Introducing Clover, a comprehensive library for generating and critically assessing tabular synthetic data. Clover provides eight synthetic data generators and a unified evaluation framework to assess the quality of the generated data. Evaluation focuses on how much information from the original data is preserved, as well as the level of privacy protection achieved.

Acknowledging the inherent trade-off between data utility and privacy, Clover is designed to support the creation of synthetic datasets that strike an effective balance between real-world usefulness and the imperative of safeguarding patient privacy. For each generator included in the library, a differentially private version is also available.

## Table of Contents

* [Useful Links](#useful-links)
* [Current Features](#current-features)
* [Usage](#usage)
  - [Requirements](#requirements)
  - [Installation](#installation)
* [Quickstart](#quickstart)
* [Join Our Community](#join-our-community)
* [Ongoing Work - Next Steps](#ongoing-work---next-steps)

## Useful Links

* [Documentation](#documentation)
* [Github Repository](https://github.com/CRCHUM-CITADEL/clover)

## Documentation

Documentation is available at : https://crchum-citadel.github.io/clover/ 

## Current Features

* Synthetic data generators incorporating integrated differential privacy, supporting continuous and categorical variables (unique identifiers are not handled):
   - [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer)
   - [Synthpop](https://github.com/hazy/synthpop)
   - [SMOTE](https://imbalanced-learn.org/stable/over_sampling.html#from-random-over-sampling-to-smote-and-adasyn)
   - [MST (Maximum Spanning Tree)](https://github.com/ryan112358/private-pgm/tree/master)
   - [CTGAN](https://github.com/sdv-dev)
   - [TVAE](https://github.com/sdv-dev)
   - [CTAB-GAN+](https://github.com/Team-TUD/CTAB-GAN-Plus)
   - [FinDiff](https://github.com/sattarov/FinDiff)
* Utility and privacy reports to assess the fidelity of the synthetic data:
   - Summary table
   - Detailed report with figures
* The following utility metrics are implemented:
   - Univariate metrics 
     - Continuous & categorical consistency 
     - Continuous & categorical statistics 
     - Hellinger distance 
     - Kullback-Leibler divergence
   - Bivariate metrics 
     - Pairwise Pearson and Spearman correlation difference 
     - Pairwise Chi-square correlation difference
   - Population metrics 
     - Distinguishability 
     - Cross learning (regression & classification)
   - Application metrics 
     - Prediction (regression & classification)
     - F-Score for binary classification with continuous variables only
     - Feature importance
* The following privacy metrics are implemented:
   - Reidentification metrics: Assess the risk of linking records in the synthetic data back to specific individuals in the original real dataset. 
     - Distance to Closest Record: Measures how similar each synthetic record is to its nearest neighbor in the real data, indicating potential for identifying near-duplicates.  
     - Nearest Neighbor Distance Ratio: Compares the distance to the nearest neighbor within the synthetic data to the distance to the nearest neighbor in the real data for synthetic points, highlighting if synthetic points are too close to real ones.
   - Membership inference attack (MIA): Evaluates how well an adversary can determine if a particular record was part of the original training dataset used to generate the synthetic data. 
     - GAN-Leaks: Specifically assesses the leakage of information from the training data in synthetic data generated by Generative Adversarial Networks (GANs).
     - Monte Carlo membership inference attack: A specific type of membership inference attack that uses Monte Carlo simulation to estimate the probability of a record being in the training data. 
     - Logan: Assesses the risk of membership inference by training a model to distinguish between the first and second generations of synthetic data. 
     - TableGan: Evaluates the vulnerability to membership inference by training both a discriminator (to distinguish between real and synthetic data) and a classifier (likely to predict whether a record was part of the training set).
     - Detector: Measures the susceptibility to membership inference by training a model to classify between the first generation of synthetic data and real data that was not used to generate the synthetic data.
     - Collision: Measures the frequency of identical or very similar records appearing in the synthetic dataset, which could indicate a privacy risk if unique real records are being replicated.
* Metareport to compare several synthetic datasets with respect to the metrics

## Usage

### Requirements
All the required packages are available in the [requirements file](requirements.txt).
Clover has been tested on a Linux system running Python 3.8.10 and Python 3.10.

### Installation
The package is available on pypi. You can install the package on a conda environment with:
```bash
pip install -i https://test.pypi.org/simple/clover-synth
```

## Quickstart
To get started, we created 4 notebooks to guide you through the generation of synthetic data,
their associated utility and privacy reports and the hyperparameters tuning:
* [Synthetic data generation](notebooks/synthetic_data_generation.ipynb)
* [Utility report](notebooks/utility_report.ipynb)
* [Privacy report](notebooks/privacy_report.ipynb)
* [Tune hyperparameters](notebooks/tune_hyperparameters.ipynb)

To get the average summary metrics results for both utility and privacy at once, see the 
[combined report](notebooks/combined_report.ipynb) notebook. To compare several synthetic datasets 
with respect to a list of metrics, see the [metareport](notebooks/metareport.ipynb) notebook.

The notebooks are based on the 
[Breast Wisconsin Cancer WBCD dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29).

## Join Our Community
If you have any question, feature request or if you have encountered an issue, please open an issue on Github.

We also welcome any contribution to the project. 
The required packages for development can be found in the [dev-requirements file](dev-requirements.txt).
The documentation was generated with Sphinx.
