Metadata-Version: 2.1
Name: datawig
Version: 0.0.1
Summary: Imputation for tables with missing values
Home-page: https://github.com/awslabs/datawig
Author: datawig-dev
Author-email: datawig-dev@amazon.com
Maintainer-email: datawig-dev@amazon.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3 :: Only
Description-Content-Type: text/markdown
Requires-Dist: numpy (==1.15.0)
Requires-Dist: scikit-learn[alldeps] (==0.19.0)
Requires-Dist: pytest (==3.2.1)
Requires-Dist: pandas (==0.22.0)
Requires-Dist: mxnet (==1.3.0b20180820)

DataWig - Imputation for Tables
================================

DataWig learns models to impute missing values in tables. 

For each to-be-imputed column, DataWig trains a supervised machine learning model
to predict the observed values in that column from the values in other columns  


# Installation

The easiest way to install the package is to use virtualenvs and pip.

Set up virtualenv in the root dir of the package:

```
python3.6 -m venv venv
```

Install the package 

```
./venv/bin/pip install -e .
```

Run tests:

```
./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest

```

# Usage 

The imputation API is expecting your data as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

For most use cases, the `SimpleImputer` class is the best starting point.
DataWig expects you to provide the column name of the column you would like to impute values for (called `output_column` below) and some column names 
that contain values that you deem useful for imputing the values in the to-be-imputed column (called `input_columns` below). 


 ```python
    from datawig import SimpleImputer
    import pandas as pd

    # some test data stored in the test/resources folder

    df_train = pd.read_csv("training_data.csv")
    df_test = pd.read_csv("testing_data_files.csv")

    # this is where the model artifacts and metrics will be stored
    output_path = "imputer_model"

    # Initialize and train Imputer
    imputer = SimpleImputer(
        input_columns=["item_name", "bullet_point"], # columns containing information about the column we want to impute
        output_column="brand" # the column we'd like to impute values for
        ).fit(train_df=df_train)

    # Impute missing values on test data
    imputed = imputer.predict(df_test)

 ```


