Metadata-Version: 2.4
Name: imperfekt
Version: 0.2.5
Summary: A framework to analyze imperfections (missingness, noise) in time-series datasets.
Project-URL: Homepage, https://github.com/krafftta/imperfekt
Project-URL: Repository, https://github.com/krafftta/imperfekt
Project-URL: Issues, https://github.com/krafftta/imperfekt/issues
Author: Tamara Krafft
License-Expression: MIT
License-File: LICENSE
Keywords: data-quality,healthcare,missing-data,missingness-analysis,polars,time-series
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: kaleido>=0.2.1
Requires-Dist: matplotlib>=3.8.0
Requires-Dist: missingno>=0.5.2
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pingouin>=0.5.0
Requires-Dist: plotly>=5.18.0
Requires-Dist: polars>=0.20.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scikit-posthocs>=0.9.0
Requires-Dist: scipy>=1.13.0
Requires-Dist: statsmodels>=0.14.0
Requires-Dist: upsetty>=0.1.0
Provides-Extra: dev
Requires-Dist: mypy>=1.7.0; extra == 'dev'
Requires-Dist: pre-commit>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Description-Content-Type: text/markdown

# Imperfekt - Understanding Data Imperfections in Time-Series

[![PyPI version](https://img.shields.io/pypi/v/imperfekt.svg)](https://pypi.org/project/imperfekt/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

A comprehensive analysis toolkit for studying "imperfect" data patterns in time-series datasets.
Imperfection refers to missingness, noise, and other data quality issues that can be indicated using a binary mask.

## Overview

This library provides tools to analyze data quality issues in time-series data, including:
- **Intravariable analysis** of imperfection patterns for individual variables
- **Intervariable analysis** of co-occurring imperfections across multiple parameters
- **Feature generation** based on missingness patterns for downstream ML tasks

## Installation

Install the library using `pip`:

```bash
pip install imperfekt
```

## Quick Start

```python
import polars as pl
from imperfekt import Imperfekt, FeatureGenerator

# Load your time-series data
df = pl.read_parquet("your_data.parquet")

# Run simple

# Configure Analyzer Setup
analyzer = Imperfekt(
    df=df,
    id_col="id",           # Unique identifier column
    clock_col="clock",     # Timestamp column
    cols=["var1", "var2"], # Variables to analyze
    save_path="./results"
)

# Simple intravariable missingness stats
analyzer.intravariable.column_statistics(save_results=True)
print(analyzer.intravariable.results.cs_overall_statistics)
print(analyzer.intravariable.results.cs_case_level_statistics)

# Run full imperfection analysis (preliminary correlations, intra- and intervariable analyses)
results = analyzer.run()

# Or generate missingness-aware features for ML
fg = FeatureGenerator(
    df=df,
    id_col="id",
    clock_col="clock",
    variable_cols=["var1", "var2"]
)
features_df = fg.add_binary_masks().add_temporal_features().df
```

## Library Structure

```
imperfekt/
├── analysis/
│   ├── preliminary/     # Basic data exploration
│   ├── intravariable/      # Single variable analysis
│   ├── intervariable/    # Multi-variable patterns
│   └── utils/           # Shared utilities
├── features/            # Feature engineering
│   ├── core.py          # FeatureGenerator class
│   ├── temporal.py      # Time-based features
│   └── interaction.py   # Variable interactions
└── config/              # Default settings
```

## Data Format

The library expects time-series data with the following structure:

| Column | Description |
|--------|-------------|
| `id` | Unique identifier for each time-series (e.g., patient, sensor) |
| `clock` | Timestamp for each observation |
| `var1`, `var2`, ... | Variables to analyze |

## Key Dependencies

- **polars**: High-performance data processing
- **plotly**: Interactive visualizations
- **scipy**: Statistical computations

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

