Metadata-Version: 2.4
Name: medsynth
Version: 1.0.0
Summary: Medical Synthetic Data Generator with Privacy-Preserving Synthesis
Author-email: Ankur Lohachab <ankur.lohachab@maastrichtuniversity.nl>
License: MIT
Project-URL: Homepage, https://github.com/ankurlohachab/medsynth
Project-URL: Documentation, https://github.com/ankurlohachab/medsynth#readme
Project-URL: Repository, https://github.com/ankurlohachab/medsynth
Project-URL: Issues, https://github.com/ankurlohachab/medsynth/issues
Keywords: medical-imaging,synthetic-data,privacy-synthesis,privacy-preserving,ct-scan,dicom,nrrd,omop,healthcare-ai,ms-sts
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: SimpleITK>=2.1.0
Requires-Dist: pydicom>=2.3.0
Requires-Dist: scikit-image>=0.19.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pynrrd>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.5.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Dynamic: license-file

# MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**MedSynth** is a Python package for generating synthetic medical images with configurable privacy protection. It creates realistic CT scans using statistical methods without requiring machine learning or GPU resources.

## Key Features

✨ **Three Generation Modes:**
1. **Pure Synthetic** - Generate CT scans from scratch using procedural methods
2. **Augmentation** - Standard augmentation of real CTs (rotation, scaling, noise)
3. **Privacy-Preserving Synthesis** - Template-based synthesis with statistical privacy protection using Multi-Scale Statistical Texture Synthesis (MS-STS)

🔒 **Privacy Protection (Mode 3 only):**
- Low mutual information (empirically < 1.8 bits on tested datasets)
- Synthetic intensity remapping via gradient-preserving transformations
- Designed to reduce re-identification risk

📊 **Multiple Output Formats:**
- DICOM (medical imaging standard)
- NRRD (3D visualization)
- OMOP CDM (healthcare data standard)

⚡ **Computational Efficiency:**
- No GPU or deep learning required
- Runs on standard workstations
- Approximately 67 seconds per 178-slice CT volume (hardware-dependent)

---

## Performance Benchmarks

Evaluated on NSCLC-Radiomics dataset (November 26, 2025):

### Privacy-Preserving Synthesis Mode (Optimized Parameters)

| Metric | Value | Interpretation |
|--------|-------|----------------|
| **SSIM (body region)** | 0.7880 | Moderate structural similarity |
| **SSIM (lung region)** | 0.9527 | High structural similarity |
| **Mutual Information** | 1.08 bits | Low information leakage |
| **Generation Speed** | 67 sec/volume | Single-threaded on M-series Mac |
| **Voxel Remapping** | 100% | All intensities transformed |

**Optimized Parameters** (15-iteration empirical search):
- Frequency domain cutoff: 0.5456
- Point spread function blur: 0.40
- Texture noise standard deviation: 5.5449
- Edge enhancement strength: 0.25

**Note:** Performance may vary on different datasets, scanners, and protocols. These benchmarks represent single-case optimization and should be validated on your specific data.

---

## Installation

### From PyPI (when published)
```bash
pip install medsynth
```

### From Source
```bash
git clone https://github.com/ankurlohachab/medsynth.git
cd medsynth
pip install -e .
```

### Development Installation
```bash
pip install -e ".[dev]"
pytest tests/
```

---

## Quick Start

### Command Line Interface

#### 1. Pure Synthetic Generation
```bash
medsynth \
  --num-subjects 10 \
  --output-dir ./output/pure_synthetic/
```

#### 2. Augmentation Mode
```bash
medsynth \
  --augment ./path/to/real/ct/dicom_folder/ \
  --num-subjects 5 \
  --output-dir ./output/augmented/
```

**⚠️ Note:** Augmentation mode does NOT provide privacy protection - use for data augmentation only.

#### 3. Privacy-Preserving Synthesis Mode
```bash
medsynth \
  --privacy-synthesis \
  --augment ./path/to/real/ct/dicom_folder/ \
  --num-subjects 10 \
  --output-dir ./output/privacy_synthesis/
```

### Python API

```python
from medsynth.config import Config
from medsynth.pipeline import SyntheticCTPipeline

# Configure privacy-synthesis
config = Config(
    num_subjects=10,
    privacy_synth_mode=True,  # Privacy-synthesis using MS-STS
    augmentation_input="./path/to/real/ct",
    output_root="./output/privacy_synthesis"
)

# Generate dataset
pipeline = SyntheticCTPipeline(config)
pipeline.generate_dataset()
```

---

## Generation Modes Comparison

### 1. Pure Synthetic
**Purpose:** Generate CT scans from procedural noise without real data input.

**Characteristics:**
- No real patient data required
- Full control over anatomical features and pathology
- Quality suitable for algorithm development and testing
- No privacy concerns (no patient data involved)

**Limitations:**
- May lack some real-world anatomical variations
- Texture patterns are synthetic

---

### 2. Augmentation
**Purpose:** Standard data augmentation for machine learning training.

**Characteristics:**
- Applies geometric transformations (rotation, scaling)
- Adds controlled noise
- High fidelity to original (SSIM typically > 0.95)
- Fast processing (~10 seconds per volume)

**⚠️ Privacy Warning:**
- Does NOT provide privacy protection
- Retains original voxel intensity patterns
- Should only be used when privacy is not a concern
- Not suitable for data sharing under HIPAA/GDPR

---

### 3. Privacy-Preserving Synthesis (Recommended for Sharing)
**Purpose:** Generate privacy-protected synthetic versions of real CT scans.

**Method:** Multi-Scale Statistical Texture Synthesis (MS-STS)

**Process:**
1. Separates low-frequency anatomy from high-frequency texture
2. Applies gradient-preserving intensity remapping to all tissue types
3. Replaces texture with synthetic scanner-realistic noise
4. Preserves anatomical structure while transforming intensities

**Measured Performance (NSCLC-Radiomics test case):**
- Body region SSIM: 0.7880 (moderate similarity)
- Lung region SSIM: 0.9527 (high similarity in diagnostic regions)
- Mutual Information: 1.08 bits (low compared to original)

**Privacy Considerations:**
- Aims to reduce mutual information below 1.8 bits
- All voxel intensities undergo statistical transformation
- Preserves anatomical topology
- Trade-off between privacy protection and diagnostic utility

**Limitations:**
- Privacy protection is empirical, not cryptographic
- May not prevent all re-identification attacks
- Should be combined with other de-identification methods
- Requires validation for your specific use case

---

## Output Formats

### DICOM
Standard medical imaging format compatible with PACS systems.

```bash
medsynth --generate-dicom --output-dir ./output/
```

### NRRD
3D volumetric format for visualization tools (3D Slicer, ITK-SNAP).

```bash
medsynth --generate-nrrd --output-dir ./output/
```

### OMOP CDM
Healthcare data standard for multi-institutional studies.

```bash
medsynth --generate-omop --output-dir ./output/
```

---

## Configuration

### Custom Parameters
```python
from medsynth.config import Config, VolumeConfig

config = Config(
    num_subjects=50,
    random_seed=42,
    privacy_synth_mode=True,
    augmentation_input="./input/ct/",
    volume=VolumeConfig(
        volume_shape=(178, 512, 512),
        spacing=(5.0, 0.976, 0.976),
        hu_range=(-1024, 3071),
        # MS-STS parameters (optimized values)
        privacy_synth_freq_cutoff=0.5456,
        privacy_synth_psf_blur_sigma=0.40,
        privacy_synth_texture_noise_std=5.5449,
        privacy_synth_edge_enhancement_strength=0.25,
    )
)
```

---

## Quality Metrics

```python
from medsynth.metrics import evaluate_synthetic_ct

results = evaluate_synthetic_ct(
    original=real_ct_volume,
    synthetic=synthetic_ct_volume,
    body_mask=body_mask,
    lung_mask=lung_mask,
    spacing=(5.0, 0.976, 0.976)
)

print(f"SSIM (body): {results['image_quality']['ssim_body']:.4f}")
print(f"PSNR: {results['image_quality']['psnr_body']:.2f} dB")
print(f"Mutual Information: {results['privacy']['mutual_information_body']:.3f} bits")
```

**Available Metrics:**
- **Image Quality:** SSIM, PSNR, MS-SSIM, NMSE
- **Clinical Utility:** SNR, CNR, Edge Sharpness
- **Privacy Analysis:** Mutual Information, Histogram Distance, Texture Analysis

---

## Examples

See `examples/` directory:

- `example_pure_synthetic.py` - Generate from scratch
- `example_augmentation.py` - Augment existing CTs
- `example_privacy_synthesis.py` - Privacy-preserving synthesis

---

## Testing

```bash
pytest tests/ -v
pytest tests/ --cov=medsynth --cov-report=html
```

---

## Citations & Data Usage

### Dataset Citation (Required)

This package was developed and tested using the NSCLC-Radiomics dataset. Users of this data must abide by the TCIA Data Usage Policy and include the following citation:

**Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2014). Data From NSCLC-Radiomics (version 4) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI**

### Software Citation

If you use MedSynth in your research, please cite:

```bibtex
@software{medsynth2025,
  title = {MedSynth: Medical Synthetic Data Generator with Privacy-Preserving Synthesis},
  author = {Lohachab, Ankur},
  year = {2025},
  month = {11},
  version = {1.0.0},
  url = {https://github.com/ankurlohachab/medsynth}
}
```

---

## Privacy & Security Disclaimer

**Important:** This software provides empirical privacy protection through statistical methods, NOT cryptographic guarantees.

- Privacy-preserving synthesis aims to reduce mutual information and re-identification risk
- Protection level depends on data characteristics, parameters, and adversary capabilities
- Should be combined with other de-identification methods (e.g., metadata removal, expert review)
- Not a substitute for proper de-identification workflows
- Users are responsible for compliance with applicable regulations (HIPAA, GDPR, etc.)
- Validation recommended for each specific use case and dataset

**Not approved for clinical use.** For research purposes only.

---

## Requirements

- Python ≥ 3.8
- NumPy ≥ 1.21
- SciPy ≥ 1.7
- SimpleITK ≥ 2.1
- PyDICOM ≥ 2.3
- scikit-image ≥ 0.19
- Pandas ≥ 1.3
- Pydantic ≥ 2.0
- pynrrd ≥ 1.0

---

## Roadmap

- [ ] Web interface
- [ ] Cloud deployment

---

## Contributing

Contributions welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request

---

## Development Transparency

**AI-Assisted Development Disclosure:** AI-assisted development tools were used exclusively for routine tasks such as syntactic error correction, formatting, and generating descriptive comments. 

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Support

- **Issues:** [GitHub Issues](https://github.com/ankurlohachab/medsynth/issues)
- **Discussions:** [GitHub Discussions](https://github.com/ankurlohachab/medsynth/discussions)

---

## Acknowledgments

- **Optimization Dataset:** NSCLC-Radiomics Collection from The Cancer Imaging Archive (TCIA)
- **Method:** Multi-Scale Statistical Texture Synthesis (MS-STS) with gradient-preserving remapping
- **Optimization:** 15-iteration parameter search conducted November 2025

---

**Author:** Ankur Lohachab

**Affiliation:** Department of Advanced Computing Sciences, Maastricht University

**Contact:** ankur.lohachab@maastrichtuniversity.nl

**Date:** November 26, 2025

**Version:** 1.0.0
