Metadata-Version: 2.4
Name: mleakdetect
Version: 0.1.0
Summary: Building a Python Library for Detecting and Preventing Data Leakage
Author-email: Songmin Lee <violet211172@gmail.com>
License: MIT
Keywords: machine-learning,data-leakage,time-series,failure-prediction,data-splitting,temporal-split,duplicate-detection
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: scikit-learn>=1.4
Provides-Extra: dev
Requires-Dist: pytest>=9.0.2; extra == "dev"

# mleakdetect: Data Leakage Detection Toolkit

A Python toolkit for detecting and measuring data leakage in machine learning failure prediction tasks, implementing methodologies from the ICPE 2025 paper ["Quantifying Data Leakage in Failure Prediction Tasks"](https://dl.acm.org/doi/abs/10.1145/3676151.3719368).

---

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Core Modules](#core-modules)
- [Usage Examples](#usage-examples)
- [Testing](#testing)
- [Project Structure](#project-structure)
- [License](#license)

---

## Overview

Data leakage is a critical issue in machine learning that occurs when information from the test set inadvertently influences the training process. This toolkit provides:

- **Quantitative measurement** of data leakage using the temporal leakage metric (L̃) from the ICPE 2025 paper
- **Temporal and group-based splitting** strategies to prevent leakage
- **Duplicate detection** (exact and near-duplicates) that can cause contamination
- **Identifier overlap analysis** between train and test sets
- **Automated analysis pipeline** with comprehensive reporting

The implementation is designed to be modular, extensible, and easy to integrate into existing ML pipelines.

---

## Features

### Core Functionality

- **Leakage Measurement**: Compute normalized leakage scores using exponential decay-based temporal similarity
- **Temporal Data Splitting**: Split time-series data chronologically with configurable time gaps
- **Group-based Splitting**: Prevent group contamination by keeping entities (e.g., devices, users) separate
- **Duplicate Detection**: Find and remove exact and near-duplicate rows
- **Overlap Analysis**: Identify shared identifiers between training and test sets
- **Automated Reporting**: Generate comprehensive analysis reports

### Design Principles

- **Modular Architecture**: Each component can be used independently
- **NumPy-style Documentation**: Complete docstrings following scientific Python standards
- **Comprehensive Testing**: 48 unit tests covering all core functionality
- **pip-installable**: Standard Python packaging with pyproject.toml

---

### Prerequisites

1. Install Python 3.12. We recommend to use a virtual environment (e.g., with conda) to avoid conflicts with other Python projects.
2. pip install mleakdetect

## Installation

This repository uses `source/` as the package root (it contains `pyproject.toml`).
Run the following commands inside the `source/` directory.

### Option 1: Install via pip (recommended)
If you just want to use the toolkit, install the latest released version:

```bash
pip install mleakdetect
````

### Option 2: Install from source (for development)
If you want to run the latest code or modify the package locally:

```bash
git clone https://github.com/song_min_0111/mleakdetect.git
cd mleakdetect
pip install .
```
---

### Basic Workflow

```python
import pandas as pd
import mleakdetect as mld


def main() -> None:
    df = pd.read_csv("examples/Bitcoin 2024.csv")

    result = mld.analyze_dataset(
        df=df,
        split_mode="temporal",
        split_type="train_test",
        time_column="Date",
        id_columns=None,
        alpha=1.0,
        similarity_threshold=0.99,
        remove_exact_dups=True,
        remove_near_dups=False,
    )

    mld.print_leakage_report(result)
    mld.save_report_to_file(result, "leakage_report.md")
    print("Saved: leakage_report.md")
```
---

## Core Modules

### 1. Leakage Measurement (`mleakdetect.measure`)

Implements the temporal leakage metric L̃ from the ICPE 2025 paper:

```python
from mleakdetect.core.measure import compute_data_leakage

leakage = compute_data_leakage(
    train_df=train,
    test_df=test,
    time_column='date',
    alpha=1.0,  # Weight for backward temporal leakage
    group_column=None  # Optional: for group-aware measurement
)
```

**Interpretation:**
- `L̃ = 0.0`: No leakage (clean split)
- `L̃ = 1.0`: Maximum leakage (train == test)
- `L̃ ∈ (0, 1)`: Partial leakage

### 2. Data Splitting (`mleakdetect.split`)

#### Temporal Split
Preserves chronological order to prevent temporal leakage:

```python
from mleakdetect.core.split import temporal_split

train, test = temporal_split(
    df,
    time_column='timestamp',
    train_ratio=0.7,
    gap='7D'  # Time gap between train and test
)
```

#### Group-based Split
Keeps groups (e.g., devices, users) separate across splits:

```python
from mleakdetect.core.split import group_split

train, test = group_split(
    df,
    group_column='device_id',
    train_ratio=0.7,
    random_state=42
)
```

### 3. Duplicate Detection (`mleakdetect.duplicates`)

#### Exact Duplicates
```python
from mleakdetect.core.duplicates import find_exact_duplicates, remove_exact_duplicates

# Find duplicates
duplicates, summary = find_exact_duplicates(df)
print(f"Found {summary.exact_duplicate_rows} duplicates")

# Remove duplicates
clean_df = remove_exact_duplicates(df, keep='first')
```

#### Near-duplicates
```python
from mleakdetect.core.duplicates import find_near_duplicates

# Find near-duplicates using cosine similarity
pairs, summary = find_near_duplicates(
    df,
    threshold=0.95,  # Similarity threshold
    subset=['feature1', 'feature2']  # Columns to compare
)
```

### 4. Identifier Overlap Analysis (`mleakdetect.overlap`)

```python
from mleakdetect.core.overlap import compute_id_overlap

overlap_df = compute_id_overlap(
    train_df=train,
    test_df=test,
    id_columns=['device_id', 'user_id']
)

print(overlap_df)
#    column  unique_train  unique_test  overlap_count  overlap_percentage_test
# 0  device_id          70           30             15                     50.0
```

---

## Usage Examples

### Example 1: Bitcoin Price Dataset (Single Time Series)

```python
import pandas as pd
from mleakdetect import temporal_split, compute_data_leakage

# Load Bitcoin price data
df = pd.read_csv('Bitcoin 2024.csv')
df['date'] = pd.to_datetime(df['date'])

# gernerate report
result = analyze_dataset(
    df,
    split_mode="temporal",
    time_column="Date",
    id_columns=None,
    alpha=1.0,
    similarity_threshold=0.99,
)

# Display report in notebook
print_leakage_report(result)
save_report_to_file(result, "bitcoin_leakage_report.md")
```

### Example 2: HDD Failure Prediction (Multi-group)

```python
from mleakdetect import group_split, compute_data_leakage

# Load HDD failure data (Backblaze dataset)
df = pd.read_parquet('hdd_data.parquet')
df['date'] = pd.to_datetime(df['date'])

# gernerate report
result = analyze_dataset(
    df,
    split_mode="group",
    time_column=date_col,
    group_column=group_col,
    id_columns=["serial_number"],
    alpha=1.0,
    similarity_threshold=0.999,
)
print_leakage_report(result)
save_report_to_file(result, "hdd_leakage_report.md")
```
---

## Testing

The package includes comprehensive unit tests covering all core functionality.

### Run all tests

```bash
# Simple
pytest
```

### Test Coverage

Current test coverage: **48 tests, 100% pass rate**

- Core leakage measurement (6 tests)
- Temporal and group splitting (8 tests)
- Duplicate detection (8 tests)
- Identifier overlap analysis (8 tests)
- Input validation (18 tests)

---

## Project Structure

```
source/
├── mleakdetect/              # Main package
│   ├── __init__.py           # Package initialization & exports
│   ├── core/                 # Core functionality modules
│   │   ├── __init__.py
│   │   ├── measure.py        # Leakage measurement
│   │   ├── split.py          # Data splitting utilities
│   │   ├── duplicates.py     # Duplicate detection
│   │   ├── overlap.py        # ID overlap analysis
│   │   ├── analysis.py       # High-level orchestrator
│   │   └── report.py         # Report generation
│   └── utils/                # Utility modules
│       ├── __init__.py
│       └── validation.py     # Input validation
├── tests/                    # Unit tests
│   ├── __init__.py
│   ├── test_measure.py
│   ├── test_split.py
│   ├── test_duplicates.py
│   ├── test_overlap.py
│   └── test_validation.py
├── examples/                 # Usage examples
│   ├── example_notebook_bitcoin.ipynb
│   └── example_notebook_hdd.ipynb
├── pyproject.toml            # Package configuration
├── README.md                 # This file
└── LICENSE                   # MIT License 
```

---

## API Reference

### Main Functions

| Function                    | Module       | Description                      |
|-----------------------------|--------------|----------------------------------|
| `compute_data_leakage()`    | `measure`    | Compute normalized leakage score |
| `temporal_split()`          | `split`      | Split data chronologically       |
| `group_split()`             | `split`      | Split data by groups             |
| `find_exact_duplicates()`   | `duplicates` | Find exact duplicate rows        |
| `find_near_duplicates()`    | `duplicates` | Find near-duplicate rows         |
| `remove_exact_duplicates()` | `duplicates` | Remove exact-duplicate rows      |
| `remove_near_duplicates()`  | `duplicates` | Remove near-duplicate rows       |
| `compute_id_overlap()`      | `overlap`    | Analyze ID overlap               |
| `analyze_dataset()`         | `analysis`   | Full analysis pipeline           |
| `print_leakage_report()`    | `report`     | Print analysis result            |

### Parameters Guide

#### Alpha Parameter (α)

Controls the weight of backward temporal leakage:

- **α = 0**: Only future leakage counts (strict temporal-only task)
- **α = 1**: Past and future equally weighted (general failure prediction)
- **α > 1**: Emphasize group contamination over temporal leakage

#### Time Gap

Specifies the time buffer between train and test:

```python
gap='0D'    # No gap
gap='1D'    # 1 day
gap='7D'    # 1 week
gap='1W'    # 1 week (alternative)
gap='12h'   # 12 hours
gap='30min' # 30 minutes
```
---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Acknowledgments

- Based on methodologies from the ICPE 2025 paper "Quantifying Data Leakage in Failure Prediction Tasks"
- Developed at Wurzburg University in Germany
- Supervisor: Daniel Grillmeyer

---

## Contact

For questions or issues, please:
- Open an issue on [GitHub](https://github.com/song_min_0111/mleakdetect/issues)

---

## Changelog
### Version 0.1.0 (2025-02-16)

- Initial release
- Core leakage measurement implementation
- Temporal and group-based splitting
- Duplicate detection (exact and near)
- ID overlap analysis
- Automated analysis pipeline
- Comprehensive test suite (48 tests)
- Complete documentation
