Metadata-Version: 2.4
Name: mostlyai
Version: 4.4.5
Summary: Synthetic Data SDK
Project-URL: homepage, https://app.mostly.ai/
Project-URL: repository, https://github.com/mostly-ai/mostlyai
Project-URL: documentation, https://mostly-ai.github.io/mostlyai/
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: <3.14,>=3.10
Requires-Dist: environs>=9.5.0
Requires-Dist: greenlet<4,>=3.1.1
Requires-Dist: gunicorn>=23.0.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: ipywidgets>=8.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=5.9.5
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pycryptodomex<4,>=3.10.0
Requires-Dist: pydantic<3,>=2.4.2
Requires-Dist: requests>=2.31.0
Requires-Dist: rich>=13.7.0
Requires-Dist: schema>=0.7.5
Requires-Dist: semantic-version>=2.10.0
Requires-Dist: smart-open>=6.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: xxhash>=3.2.0
Provides-Extra: databricks
Requires-Dist: databricks-sql-connector<4,>=3.2.0; extra == 'databricks'
Provides-Extra: googlebigquery
Requires-Dist: sqlalchemy-bigquery<2,>=1.6.1; extra == 'googlebigquery'
Provides-Extra: hive
Requires-Dist: impyla<0.20,>=0.19.0; extra == 'hive'
Requires-Dist: kerberos<2,>=1.3.1; extra == 'hive'
Requires-Dist: pyhive[hive-pure-sasl]<0.8,>=0.7.0; extra == 'hive'
Provides-Extra: local
Requires-Dist: adlfs>=2023.4.0; extra == 'local'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local'
Requires-Dist: filelock>=3.16.1; extra == 'local'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local'
Requires-Dist: joblib>=1.2.0; extra == 'local'
Requires-Dist: mostlyai-engine==1.1.10; extra == 'local'
Requires-Dist: mostlyai-qa==1.5.11; extra == 'local'
Requires-Dist: networkx<4,>=3.0; extra == 'local'
Requires-Dist: openpyxl>=3.1.5; extra == 'local'
Requires-Dist: python-multipart>=0.0.20; extra == 'local'
Requires-Dist: s3fs>=2023.1.0; extra == 'local'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'local'
Requires-Dist: sshtunnel<0.5,>=0.4.0; extra == 'local'
Requires-Dist: torch<2.7.0,>=2.6.0; (sys_platform != 'linux') and extra == 'local'
Requires-Dist: torch==2.6.0; (sys_platform == 'linux') and extra == 'local'
Requires-Dist: torchaudio==2.6.0; (sys_platform == 'linux') and extra == 'local'
Requires-Dist: torchvision==0.21.0; (sys_platform == 'linux') and extra == 'local'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local'
Requires-Dist: xlsxwriter<4,>=3.1.9; extra == 'local'
Provides-Extra: local-cpu
Requires-Dist: adlfs>=2023.4.0; extra == 'local-cpu'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local-cpu'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local-cpu'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-cpu'
Requires-Dist: filelock>=3.16.1; extra == 'local-cpu'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local-cpu'
Requires-Dist: joblib>=1.2.0; extra == 'local-cpu'
Requires-Dist: mostlyai-engine[cpu]==1.1.10; extra == 'local-cpu'
Requires-Dist: mostlyai-qa[cpu]==1.5.11; extra == 'local-cpu'
Requires-Dist: networkx<4,>=3.0; extra == 'local-cpu'
Requires-Dist: openpyxl>=3.1.5; extra == 'local-cpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-cpu'
Requires-Dist: s3fs>=2023.1.0; extra == 'local-cpu'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local-cpu'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'local-cpu'
Requires-Dist: sshtunnel<0.5,>=0.4.0; extra == 'local-cpu'
Requires-Dist: torch<2.7.0,>=2.6.0; (sys_platform != 'linux') and extra == 'local-cpu'
Requires-Dist: torch==2.6.0+cpu; (sys_platform == 'linux') and extra == 'local-cpu'
Requires-Dist: torchaudio==2.6.0+cpu; (sys_platform == 'linux') and extra == 'local-cpu'
Requires-Dist: torchvision==0.21.0+cpu; (sys_platform == 'linux') and extra == 'local-cpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-cpu'
Requires-Dist: xlsxwriter<4,>=3.1.9; extra == 'local-cpu'
Provides-Extra: local-gpu
Requires-Dist: adlfs>=2023.4.0; extra == 'local-gpu'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local-gpu'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local-gpu'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-gpu'
Requires-Dist: filelock>=3.16.1; extra == 'local-gpu'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: joblib>=1.2.0; extra == 'local-gpu'
Requires-Dist: mostlyai-engine[gpu]==1.1.10; extra == 'local-gpu'
Requires-Dist: mostlyai-qa[gpu]==1.5.11; extra == 'local-gpu'
Requires-Dist: networkx<4,>=3.0; extra == 'local-gpu'
Requires-Dist: openpyxl>=3.1.5; extra == 'local-gpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-gpu'
Requires-Dist: s3fs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local-gpu'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'local-gpu'
Requires-Dist: sshtunnel<0.5,>=0.4.0; extra == 'local-gpu'
Requires-Dist: torch<2.7.0,>=2.6.0; (sys_platform != 'linux') and extra == 'local-gpu'
Requires-Dist: torch==2.6.0; (sys_platform == 'linux') and extra == 'local-gpu'
Requires-Dist: torchaudio==2.6.0; (sys_platform == 'linux') and extra == 'local-gpu'
Requires-Dist: torchvision==0.21.0; (sys_platform == 'linux') and extra == 'local-gpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-gpu'
Requires-Dist: xlsxwriter<4,>=3.1.9; extra == 'local-gpu'
Provides-Extra: mssql
Requires-Dist: pyodbc<6,>=5.1.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python<10,>=9.1.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb<3,>=2.2.1; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: psycopg2<3,>=2.9.4; extra == 'postgres'
Provides-Extra: snowflake
Requires-Dist: snowflake-sqlalchemy<2,>=1.6.1; extra == 'snowflake'
Description-Content-Type: text/markdown

# Synthetic Data SDK ✨

[![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/releases)
[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai/)
[![PyPI Downloads](https://static.pepy.tech/badge/mostlyai)](https://pepy.tech/projects/mostlyai)
[![License](https://img.shields.io/github/license/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/blob/main/LICENSE)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai)](https://pypi.org/project/mostlyai/)
[![GitHub stars](https://img.shields.io/github/stars/mostly-ai/mostlyai?style=social)](https://github.com/mostly-ai/mostlyai/stargazers)

[Documentation](https://mostly-ai.github.io/mostlyai/) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)

The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.

- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.

## Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
3. **Connectors** - Connect to any data source within your organization, for reading and writing data

| Intent                                        | Primitive                         | API Reference                                                                                                 |
|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
| Train a Generator on tabular or language data | `g = mostly.train(config)`        | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train)       |
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
| Live probe the generator on demand            | `df = mostly.probe(g, config)`    | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe)       |
| Connect to any data source within your org    | `c = mostly.connect(config)`      | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect)   |

<https://github.com/user-attachments/assets/d1613636-06e4-4147-bef7-25bb4699e8fc>

## Key Features

- **Broad Data Support**
  - Mixed-type data (categorical, numerical, geospatial, text, etc.)
  - Single-table, multi-table, and time-series
- **Multiple Model Types**
  - TabularARGN for SOTA tabular performance
  - Fine-tune HuggingFace-based language models
  - Efficient LSTM for text synthesis from scratch
- **Advanced Training Options**
  - GPU/CPU support
  - Differential Privacy
  - Progress Monitoring
- **Automated Quality Assurance**
  - Quality metrics for fidelity and privacy
  - In-depth HTML reports for visual analysis
- **Flexible Sampling**
  - Up-sample to any data volumes
  - Conditional generation by any columns
  - Re-balance underrepresented segments
  - Context-aware data imputation
  - Statistical fairness controls
  - Rule-adherence via temperature
- **Seamless Integration**
  - Connect to external data sources (DBs, cloud storages)
  - Fully permissive open-source license

## Quick Start <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

Install the SDK via pip:

```shell
pip install mostlyai
```

Train your first generator:

```python
import pandas as pd
from mostlyai.sdk import MostlyAI

# load original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz")
df_original = df_original.sample(n=10_000)  # sub-sample to speed up demo

# initialize the SDK
mostly = MostlyAI()

# train a synthetic data generator, with default configs
g = mostly.train(name="Quick Start Demo", data=df_original)

# display the quality assurance report
g.reports(display=True)
```

Once the generator has been trained, generate synthetic data samples. Either via probing:

```python
# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples
```

or by creating a synthetic dataset entity for larger data volumes:

```python
# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic
```

or by conditionally probing / generating synthetic data:

```python
# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
    'age': [24] * 100,
    'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples
```

## Installation

 Use `pip` (or better `uv pip`) to install the official `mostlyai` package via PyPI. Python 3.10 or higher is required.

 It is highly recommended to install the package within a dedicated virtual environment, such as **venv**, **uv**, or **conda**. E.g.
 ```shell
conda create -n mostlyai python=3.12
conda activate mostlyai
 ```

### CLIENT mode

This is a light-weight installation for using the SDK in CLIENT mode only. It communicates to a MOSTLY AI platform to perform requested tasks. See e.g. [app.mostly.ai](https://app.mostly.ai/) for a free-to-use hosted version.

```shell
pip install -U mostlyai
```

### CLIENT + LOCAL mode

This is a full installation for using the SDK in both CLIENT and LOCAL mode. It includes all dependencies, incl. PyTorch, for training and generating synthetic data locally.

```shell
# for CPU on macOS
pip install -U 'mostlyai[local]'
# for CPU on Linux
pip install -U 'mostlyai[local-cpu]' --extra-index-url https://download.pytorch.org/whl/cpu
# for GPU on Linux
pip install -U 'mostlyai[local-gpu]'
```

> **Note for Google Colab users**: Installing any of the local extras (`mostlyai[local]`, `mostlyai[local-cpu]`, or `mostlyai[local-gpu]`) will downgrade PyTorch from 2.6.0 to 2.5.1. You'll need to restart the runtime after installation for the changes to take effect.

Add any of the following extras for further data connectors support in LOCAL mode: `databricks`, `googlebigquery`, `hive`, `mssql`, `mysql`, `oracle`, `postgres`, `snowflake`. E.g.

```shell
pip install -U 'mostlyai[local, databricks, snowflake]'
```

## Citation

Please consider citing our project if you find it useful:

```bibtex
@software{mostlyai,
    author = {{MOSTLY AI}},
    title = {{MOSTLY AI SDK}},
    url = {https://github.com/mostly-ai/mostlyai},
    year = {2025}
}
```
