Metadata-Version: 2.1
Name: data-cleaning
Version: 1.0.0
Summary: An utility to clean the data and return you the cleaned data
Home-page: https://github.com/DataPreprocessing/DataCleaning
Author: Murali krishna mopi devi, Induraj P.Ramamurthy
Author-email: mopidevi@gmail.com, induraj.gandhian@yahoo.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6
Description-Content-Type: text/markdown


<img src="https://github.com/DataPreprocessing/DataCleaning/blob/mopidevimu/img/datacleaning.png" width="100%"/>

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/data-cleaning)
![PyPI - License](https://img.shields.io/pypi/l/data-cleaning)
![PyPI](https://img.shields.io/pypi/v/data-cleaning)
![GitHub repo size](https://img.shields.io/github/repo-size/DataPreprocessing/DataCleaning)

<h1 align="center">DATA CLEANING</h1>
## Description
<p>In any Machine Learning process, Data Preprocessing is the primary step wherein the raw/unclean data are transformed 
into cleaned data, So that in the later stage, machine learning algorithms can be applied. This python paackage make the 
data preprocessing very easy in just 2 lines of code. All you have to do is just input a raw data(CSV file), this library
will clean your data and return you the cleaned dataframe on which further you can apply feature engineering, 
feature selection and modeling.

- What this does?
    * Cleans special character
    * Removes duplicates
    * Fixes abnormality in column names 
    * Imputes the data (categorical & numerical)

</p>

## Data Cleaning

<p>Data-cleaning is a python package for data preprocessing. This cleans the CSV file and returns the <b>cleaned data frame</b>. 
It does the work of imputation, removing duplicates, replacing special characters, and many more.</p>

## How to use:

Step 1:
  Install the libaray

````python
pip install data-cleaning
````
Step 2:

  Import the library, and specify the path of the csv file. 
````python
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv')
cleaned_df = dp.start_cleaning()
````

There are some optional parameters that you can specify as listed below,

## Usage:

````python
from datacleaning import DataCleaning

DataCleaning(file_uploaded='filename.csv', separator=",", row_threshold=None, col_threshold=None,
         special_character=None, action=None, ignore_columns=None, imputation_type="RDF")
````

## Parameters

------

| Parameter | Default Value | Limit | Example |
| ------ | ------ | ------ | ------ |
| file_uploaded | ***none*** | Provide a CSV file. | filename.csv |
| separator | ***,*** | Separator used in csv file | ****;****
| row_threshold | ***none*** | 0 to 100 | 80 | 
| col_threshold | ***none*** | 0 to 100 | 80 | 
| special_character | Check the list below |Sspecify the character <br> that is not listed in default_list (see below) | [ '$' , '?' ] | 
| action | ***none*** | ***add*** or ***remove*** | add | 
| ignore_columns | ***none*** | Provide list of column names <br> to ignoring the special characters operation. | [ 'column1', 'column2' ] | 
| imputation_type | ***RDF*** | Select your preferred imputation <br> ***RDF***, ***KNN***, ***mean***, ***median***, ***most_frequent***, ***constant*** . | KNN | 


## Examples of using parameters

### - Appending extra special characters to the existing default_list

The DEFAULT SPECIAL CHARACTERS included in the package are shown below,

````python
default_list = ["!", '"', "#", "%", "&", "'", "(", ")",
                  "*", "+", ",", "-", ".", "/", ":", ";", "<",
                  "=", ">", "?", "@", "[", "\\", "]", "^", "_",
                  "`", "{", "|", "}", "~", "â€“", "//", "%*", ":/", ".;", "Ã˜", "Â§",'$',"Â£"]
````
How to remove a special character, say for example if you want to remove "?" and "%".

<i>Note:- Do not forget to give <b> action = 'remove' </b></i>

````python
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['?', '%'], action='remove')
cleaned_df = dp.start_cleaning()
````
How to add a special character, say for example if you want to add "Ã©" that is not in the default_list given above.

<i>Note:- Do not forget to give <b> action = 'add' </b></i>

````python
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['Ã©'], action='add')
cleaned_df = dp.start_cleaning()
````

### - Ignoring a particular columns and adding a special character
Say for example, column named "timestamp" and "date" needs to be removed and a special character needs to be added 'Ã©'

````python
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', special_character =['Ã©'],
              action='add', ignore_columns=['timestamp', 'date'])
cleaned_df = dp.start_cleaning()
````

### - Changing threshold to remove null rows/columns above this given threshold value

````python
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', row_threshold=50, col_threshold=90)
cleaned_df = dp.start_cleaning()
````    

### - Imputation methods available

  - RDF (RandomForest) -> (DEFAULT)
  - KNN (k-nearest neighbors)
  - mean
  - median
  - most_frequent
  - constant

````python
# Example for KNN imputation.
from datacleaning import DataCleaning

dp = DataCleaning(file_uploaded='filename.csv', imputation_type='KNN')
cleaned_df = dp.start_cleaning()
````

<h2 align="center"> >> THANK YOU << </h2>


