Metadata-Version: 2.4
Name: featureleak
Version: 0.1.0
Summary: Automatic data leakage detection for tabular and time-series ML workflows.
Author: Christian McBride
License: MIT License
        
        Copyright (c) 2026 Christian McBride
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/christianmcb/feature_leak
Project-URL: Documentation, https://github.com/christianmcb/feature_leak
Project-URL: Repository, https://github.com/christianmcb/feature_leak
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: tqdm
Dynamic: license-file



<p align="center">
    <img src="featureleak_banner.png" alt="FeatureLeak Banner" width="600">
</p>

# 🚨 FeatureLeak: Stop Data Leakage Before It Stops You

**The #1 Python tool for catching data leakage in machine learning.**

[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

---

## Why FeatureLeak?

**Data leakage is the silent killer of machine learning projects.** It inflates your metrics, sabotages your models in production, and can cost you months of wasted effort. FeatureLeak is your automated guardrail—scanning your data for the most insidious forms of leakage before you ever train a model.

- **Instantly scan your data for 10+ leakage types**
- **Zero configuration needed**—works out of the box
- **Actionable, human-readable reports**
- **CLI & Python API** for seamless integration
<!-- - **Trusted by data scientists, ML engineers, and researchers** -->

---

## 🚀 Quick Demo

### Python API
```python
from featureleak import LeakScanner
import pandas as pd

df = pd.read_csv('data.csv')
scanner = LeakScanner()
report = scanner.scan(df, target='target')

print(report.summary())  # Human-readable summary
print(report.issues)     # List of detected issues
```

### Command Line
```bash
featureleak scan data.csv --target target --output report.json
```

---

## What Can FeatureLeak Catch?

- **Target Leakage**: Features that "cheat" by revealing the answer
- **Temporal Leakage**: Using future info in past predictions
- **Train-Test Contamination**: Overlap between train/test sets
- **Entity Leakage**: Same entity in both train and test
- **Aggregation Leakage**: Pre-aggregated stats leaking test info
- **Identifier Leakage**: Unique IDs that act as shortcuts
- **Missingness Leakage**: Patterns of missing data that reveal the target
- **Duplicate Leakage**: Duplicated rows across splits
- **Preprocessing Inconsistencies**: Different transforms for train/test
- **Distribution Shift**: Major differences between train and test

**...and more!**

---

## 📊 Example Output

```
FeatureLeak Report
──────────────────────────────
Risk score: 75/100 (High)
Total issues: 3
High risk features: 1
Medium risk features: 2

1. [HIGH] target_leakage
     Feature 'previous_target' is 0.98 correlated with target
     Suggested fix: Remove or investigate this feature
```

---

## 🔧 Configuration (Optional)

```python
scanner = LeakScanner(
        target_corr_threshold=0.98,  # Correlation threshold for target leakage
        overlap_threshold=0.0,       # Allowable train-test overlap
        sample_size=10000            # Sample for large datasets
)
```

---

## 💡 Why Data Scientists Love FeatureLeak

- **Saves you from embarrassing mistakes** before deployment
- **Works with any tabular data** (CSV, pandas DataFrame)
- **Handles time series, entity-based, and large datasets**
- **No black box**: Every issue comes with a clear explanation and fix
- **Open source, MIT licensed**

---

## 📦 Installation

```bash
pip install featureleak
```

**Requirements:** Python 3.10+, pandas 2.0+, numpy 1.24+, scikit-learn 1.3+

---

## 🛠️ Integrate Anywhere

- **Python API**: Use in notebooks, scripts, or pipelines
- **CLI**: Scan datasets from the terminal or CI/CD
- **JSON Reports**: Easy to parse and automate

---

## 📝 Documentation & Help

- Run `featureleak --help` for CLI options
- See [examples in the docs](https://github.com/yourusername/feature_leak)
- Open an issue or PR—contributions welcome!

---

## 📄 License

MIT License. See [LICENSE](LICENSE).

---

## 📣 Citing FeatureLeak

If you use FeatureLeak in research, please cite:

```bibtex
@software{featureleak2026,
    author = {McBride, Christian},
    title = {FeatureLeak: Automated Data Leakage Detection},
    year = {2026},
    url = {https://github.com/yourusername/feature_leak}
}
```

---

## 🙏 Acknowledgments

Built with [pandas](https://pandas.pydata.org/), [scikit-learn](https://scikit-learn.org/), and [numpy](https://numpy.org/).
