Metadata-Version: 2.1
Name: dsx
Version: 0.9.3.0
Summary: The utilities pack for data science and analytics task. The core module ds_utils (Data Science Utilities) is designed to work with Pandas to simplify common tasks, such as generating metadata for the dataframe, validating merged dataframe, and visualizing dataframe. 
Home-page: https://nicdatalab.ml/data-analytics/dsx
Author: NicTsyen
Author-email: support@nicdatalab.com
License: GNU GENERAL PUBLIC LICENSE
Download-URL: https://github.com/NicTsyen/dsx
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Requires-Dist: joblib
Requires-Dist: seaborn
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: matplotlib
Requires-Dist: openpyxl
Requires-Dist: regex
Requires-Dist: typing

# << Data Science Utilities (DSX)>>
The **dsx** package contains a collection of wrapper functions to simplify common operations in data analytics tasks. 
The core module ds_utils (data science utilities) is designed to work with DataFrame in Pandas
to simplify common tasks.

The package can be can be used in the following setup:
- Jupyter Notebook
- Jupyter Lab
- PyCharm's Python Console
- iPython Console
- Python Script

![xqrid](https://i.imgur.com/yi2kZf6.png)

## Intallation
- Installation using Pip:


    pip install dsx


## Documentation
Full Documentation Site: [https://nicdatalab.com/static/pages/docs_dsx/index.html](https://nicdatalab.com/static/pages/docs_dsx/index.html)


## 1. Core Module: "ds_utils"
The core module is "ds_utils". 
The module contains a list of functions that can accomplish
common data analytics tasks with less codes. 
Basically, these functions are wrappers for commonly-used methods in Pandas, particularly
methods of DataFrame object. 

Some of the key features of the DataFrame utility functions are as following:

- Generate metadata of columns in a DataFrame
  - Number & percentage of missing values
  - Number & percentage of unique values
  - Data Type
- Generate accumulated percentage of values in a column
- Quick Rename of a single column
- Reorder columns of a DataFrame
- Standardize column names into iPython-friendly names
- Retrieve column name(s) by a partial keyword
- Expand concatenated string in a column into child table 
- Visualize DataFrame object
  - DataGrid Viewer
  - Pivot Table Viewer
  - Quick Analyzer (Pivot table and visualizations)


### 1.1 Usage

Below is example codes for importing the module:

    from dsx.ds_utils import *


There are 2 categories of methods in **dsx's** classes, which are to be called in different ways:
- **Methods:** Dynamic functions of the class's instance
  - Invoke through the extended domain (**'ds'**) of the native DataFrame object


    df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))
    df.ds.isnull("Column_Name")


- **Static functions** Static functions from the class's object
  - Invoke as a static function of pd_utils class


    df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))
    dsx.isnull(df, "Column_Name")

![xpvt](https://i.imgur.com/0NAN16i.png)


## 2. Data Science Workflow "ds_workflow" (Active Development / Work-In-Progress)
The **"ml_utils"** module contains the methods for simplifying common tasks in
a data science workflow. The methods are built on top of the functions in 
the core module **"pd_utils"**.


Some of the key features of the module are as the following:
- Get the column name of the features that are categorical
- Get the column name of the features that are numerical
- Create or merge the dummy variables created from categorical features
with option to use k-1 dummification
- Data Exploration
  - Generate barplot and accumulated percentage report for all the categorical features
  - Generate distribution plot for all the numerical features
  - Generate heatmap of the the correlation matrix
- Preprocessing
  - Create a dataframe with all standardized features merged with other features
  - Generate features list 
- Model Assessment
  - Generate Recall-Precision-Threshold Curve
  - Generate truepositive_falsepositive Curve


### 2.1 Usage
The methods in the module are only callable as the extended domain **'ml'** in the native Pandas DataFrame object.


Calling a method in **"ml_workflow"**:

    df = pd.read_excel(os.path.join(os.getcwd(), "data.xlsx"))

    cols_categorical = df.ml.get_features_categorical()




