Metadata-Version: 2.4
Name: proxiflow
Version: 0.1.11
Summary: Data Preprocessing flow tool in python
Author-email: Martin Tomes <its@martintomes.net>
License: MIT
Project-URL: Repository, https://gitlab.com/martintomes/proxiflow
Project-URL: Documentation, http://proxiflow.readthedocs.io
Project-URL: Bug Tracker, https://gitlab.com/martintomes/proxiflow/-/issues
Classifier: Development Status :: 1 - Planning
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: polars
Requires-Dist: numpy
Requires-Dist: click
Requires-Dist: pyaml
Requires-Dist: scipy
Requires-Dist: scikit-learn
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: pytype; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: sphinx; extra == "docs"
Dynamic: license-file

[![PyPi version](https://badgen.net/pypi/v/proxiflow/)](https://pypi.org/project/proxiflow)
[![Documentation Status](https://readthedocs.org/projects/proxiflow/badge/?version=latest)](https://proxiflow.readthedocs.io/en/latest/?badge=latest)
[![PyPI download month](https://img.shields.io/pypi/dm/proxiflow.svg)](https://pypi.python.org/pypi/proxiflow/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/itsmartintomes/proxiflow/graphs/commit-activity)
[![PyPI license](https://img.shields.io/pypi/l/proxiflow.svg)](https://pypi.python.org/pypi/proxiflow/)
[![tests](https://github.com/itsmartintomes/proxiflow/actions/workflows/tests.yml/badge.svg)](https://github.com/itsmartintomes/proxiflow/actions/workflows/tests.yml)


# ProxiFlow

ProxiFlow is a feature engineering and  data preprocessig tool for any 
kind of data processing (eg. machine learning) that performs
data cleaning, normalization, and feature engineering.

The biggest advantage of this library (which is basically a wrapper over [polars](https://github.com/pola-rs/polars) data frame) is that it is configurable via YAML configuration file which makes it suitable for MLOps pipelines or for building API requests over it.

## Documentation
Read the full documentation [here](http://proxiflow.readthedocs.io/).

## Usage

To use ProxiFlow, install it via pip:

```bash
pip install proxiflow
```

You can then call it from the command line:

```bash
proxiflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv
```

Here\'s an example of a YAML configuration file:

```yaml
input_format: csv
output_format: csv

data_cleaning: #mandatory
  # NOTE: Not handling missing values can cause errors during data normalization
  handle_missing_values:
    drop: false
    mean: true # Only Int and Float columns are handled 
    # mode: true # Turned off for now. 
    knn: true

  handle_outliers: true # Only Float columns are handled
  remove_duplicates: true

data_normalization: # mandatory
  min_max: #mandatory but values are not mandatory. It can be left empty
    # Specify columns:
    - Age # not mandatory
  z_score: 
    - Price 
  log:
    - Floors

feature_engineering:
  one_hot_encoding: # mandatory
    - Bedrooms      # not mandatory

  feature_scaling:  # mandatory
    degree: 2       # not mandatory. It specifies the polynominal degree
    columns:        # not mandatory
      - Floors      # not mandatory
```

The above configuration specifies that duplicate rows should be removed
and missing values should be dropped.

## API

ProxiFlow can also be used as a Python library. Here\'s an example:

```python
import polars as pl
from proxiflow.config import Config
from proxiflow.core import Cleaner

# Load the data
data = pl.read_csv("mydata.csv")

# Load the configuration
config = Config("myconfig.yaml")

# Clean the data
cleaner = Cleaner(config)
cleaned_data = cleaner.clean_data(data)

# Perform data normalization
normalizer = Normalizer(config)
normalized_data = normalizer.normalize(cleaned_data)

# Perform feature engineering
engineer = Engineer(config)
engineered_data = engineer.execute(normalized_data)

# Write the output data
engineered_data.write_csv("cleaned_data.csv")
```

## Log

-   \[x\] Data cleaning
    - \[x\] Missing values handling
        - \[x\] Mean
        - \[x\] Drop
        - \[x\] KNN Imputer 
        - \[x\] Median
-   \[x\] Data normalization
    - \[x\] Min Max normalization
    - \[x\] Z-Score normalization
    - \[x\] Logarithmic normalization
-   \[ ] Feature engineering
    - \[x\] One Hot Encoding
    - \[x\] Feature Scaling
    - \[ ] Recursive Feature Elimination
    - \[ ] SelectKBest
    - \[ ] LASSO regularization
-   \[ ] Text Preprocessing
    - \[ ] Tokenization
    - \[ ] Stemming
    - \[ ] Stopword removal
    - \[ ] Text Vectorization
        - \[ ] Bag of Words
        - \[ ] TF-IDF
    - \[ ] Word embeddings
        -  \[ ] Word2Vec
        -  \[ ] GloVe
        -  \[ ] BERT
- \[ ] Categorical Encoding
    - \[ ] Target encoding
    - \[ ] Count encoding
    - \[ ] Binary encoding
- \[ ] Dimensionality reduction
    - \[ ] Principal Component Analysis (PCA)
    - \[ ] t-Distributed Stochastic Neighbor Embedding (t-SNE)
    - \[ ] Uniform Manifold Approximation and Projection (UMAP) 
