Metadata-Version: 2.1
Name: nullsweep
Version: 0.6.0
Summary: A comprehensive Python package for managing and analyzing missing data in pandas DataFrames, starting with detection and expanding to complete handling.
Home-page: https://github.com/okanyenigun/nullsweep
Author: Okan Yenigün
Author-email: okanyenigun@gmail.com
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENCE.md
Requires-Dist: pandas==2.2.2
Requires-Dist: scipy==1.13.1
Requires-Dist: statsmodels==0.14.2
Requires-Dist: scikit-learn==1.6.1
Requires-Dist: seaborn==0.13.2
Requires-Dist: missingno==0.5.2
Requires-Dist: upsetplot==0.9.0
Requires-Dist: wordcloud==1.9.4
Requires-Dist: polars==1.23.0
Requires-Dist: pyarrow==19.0.1
Requires-Dist: dask==2025.4.1
Requires-Dist: fsspec==2025.3.2
Requires-Dist: locket==1.0.0
Requires-Dist: partd==1.4.2
Requires-Dist: pyyaml==6.0.2
Requires-Dist: toolz==1.0.0
Requires-Dist: jinja2==3.1.6
Requires-Dist: dask-ml==2025.1.0
Provides-Extra: dev
Requires-Dist: twine==5.1.1; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest==8.2.2; extra == "test"

# NullSweep

NullSweep is a Python library designed for detecting and handling patterns of missing data in pandas DataFrames. This tool provides a simple API to identify global missing data patterns across the entire dataset, patterns related to specific features within the dataset, and to impute missing values using various strategies.

It supports **pandas**, **polars**, **dask**, **pyspark** DataFrames.

## Features

- Detect global patterns of missing data in a DataFrame.
- Detect missing data patterns in specific features/columns of a DataFrame.
- Impute missing data in specific features or across the entire DataFrame using a variety of strategies.
- Utilizes a modular approach with different pattern detection and imputation strategies.
- Visualize missing data patterns for better understanding and analysis.

## Installation

Install NullSweep using pip:

```bash
pip install nullsweep
```

## Usage

### Detect Global Patterns

To detect global missing data patterns in a pandas DataFrame:

```python
import pandas as pd
from nullsweep import detect_global_pattern

# Sample DataFrame
data = {'A': [1, 2, None], 'B': [None, 2, 3]}
df = pd.DataFrame(data)

# Detect global missing data pattern
pattern, details = detect_global_pattern(df)
print("Detected Pattern:", pattern)
print("Details:", details)
```

### Detect Feature-Specific Patterns

To detect missing data patterns in a specific feature of a pandas DataFrame:

```python
import pandas as pd
from nullsweep import detect_feature_pattern

# Sample DataFrame
data = {'A': [1, None, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Detect feature-specific missing data pattern
feature_name = 'A'
pattern, details = detect_feature_pattern(df, feature_name)
print("Detected Pattern:", pattern)
print("Details:", details)
```

### Impute Missing Values

To handle missing values in a pandas DataFrame, the `impute_nulls` function offers a unified interface for various imputation strategies. Below is a comprehensive list of supported strategies, grouped by their functionality, and details about the parameters this function accepts.

#### **Imputation Strategies**

- **Deletion-Based Strategies**:

  - **`delete_column`**: Removes columns that meet certain criteria for missing values (e.g., columns with any or a threshold of missing values).
  - **`listwise`**: Deletes rows with missing values based on specified thresholds.

- **Flagging Strategy**:

  - **`flag`**: Creates binary indicator columns to flag the presence of missing values.

- **Nearest Neighbors Strategies**:

  - **`knn`**: Uses K-Nearest Neighbors imputation to estimate missing values based on similarity to other data points.

- **Multivariate Strategies**:

  - **`mice`**: Performs multiple imputation using chained equations (MICE) to estimate missing values.
  - **`regression`**: Uses regression-based imputation where missing values are predicted using regression models fitted on non-missing data.

- **Continuous Features**:

  - **`mean`**: Replaces missing values with the mean of the column.
  - **`median`**: Replaces missing values with the median of the column.
  - **`most_frequent`**: Replaces missing values with the most frequent value in the column.
  - **`constant`**: Replaces missing values with a user-provided constant value.
  - **`interpolate`**: Uses interpolation (linear or polynomial) to estimate missing values.
  - **`forwardfill`**: Fills missing values with the last non-missing value in a forward direction.
  - **`backfill`**: Fills missing values with the next non-missing value in a backward direction.

- **Categorical Features**:

  - **`most_frequent`**: Replaces missing values with the most frequent value.
  - **`least_frequent`**: Replaces missing values with the least frequent value.
  - **`constant`**: Replaces missing values with a user-provided constant value.
  - **`forwardfill`**: Fills missing values with the last non-missing value in a forward direction.
  - **`backfill`**: Fills missing values with the next non-missing value in a backward direction.

- **Date Features**:

  - **`interpolate`**: Uses time-based interpolation to estimate missing values.
  - **`forwardfill`**: Fills missing values with the last non-missing value in a forward direction.
  - **`backfill`**: Fills missing values with the next non-missing value in a backward direction.

- **Automatic Strategy Detection**:
  - **`auto`**: Automatically determines the best strategy for each column based on its data type and characteristics.

---

#### **Parameters**

- **`df`** _(pd.DataFrame)_:  
  The input pandas DataFrame containing the data to process. Must not be empty.

- **`column`** _(Optional[Union[Iterable, str]])_:  
  The target column(s) for imputation. Can be a single column name (str), a list of column names (Iterable), or `None`. If `None`, all columns with missing values will be considered for imputation.

- **`strategy`** _(str)_:  
  The imputation strategy to use. Refer to the above list for supported strategies. Defaults to `"auto"`.

- **`fill_value`** _(Optional[Any])_:  
  A constant value to use for imputation when the strategy is `"constant"`.

- **`strategy_params`** _(Optional[Dict[str, Any]])_:  
  Additional parameters to configure the imputation strategy. Examples include:

  - For `interpolate`: `{"method": "linear", "order": 2}` to specify a polynomial interpolation.
  - For `constant`: `{"fill_value": 0}` for numeric columns or `"missing"` for categorical columns.

- **`in_place`** _(bool)_:  
  Whether to modify the input DataFrame in place. If `True`, the DataFrame is updated directly. If `False`, a copy of the DataFrame is returned. Defaults to `True`.

- **`**kwargs`\*\* _(Any)_:  
  Additional arguments for handler-specific configurations or compatibility.

```python
import pandas as pd
import nullsweep as ns

# Sample DataFrame
data = {
    'Age': [25, 30, None, 35, 40],
    'Gender': ['Male', 'Female', None, 'Female', 'Male']
}
df = pd.DataFrame(data)

# Impute missing values in 'Age' using mean
df = ns.impute_nulls(df, column='Age', strategy='mean')

# Impute missing values in 'Gender' using the most frequent value
df = ns.impute_nulls(df, column='Gender', strategy='most_frequent')

# Impute missing values in 'Age' using linear interpolation
df = ns.impute_nulls(df, column='Age', strategy='interpolate')

# Impute missing values for multiple columns
df = ns.impute_nulls(df, column=['Age', 'Gender'], strategy='interpolate')

# Impute all features with missing values using automatic strategy detection
df = ns.impute_nulls(df)

# Drop rows with missing values
df = ns.impute_nulls(df, strategy="listwise")

# Create missing flags for multiple columns in new columns
df = ns.impute_nulls(df, column=['Age', 'Gender'], strategy="flag")

```

### Visualize Missing Values

Options: 'heatmap', 'correlation', 'percentage', 'matrix', 'dendogram', 'upset_plot', 'pair', 'wordcloud', 'histogram'

```python

figure = ns.plot_missing_values(df, "heatmap")
```

## Contributing

Contributions are welcome! Please feel free to submit pull requests, open issues, or suggest improvements.
