Metadata-Version: 2.1
Name: pyspark_eda
Version: 1.3.2
Summary: A Python package for univariate and bivariate data analysis using PySpark
Author: Tanya Irani
Author-email: tanyairani22@gmail.com.com
Keywords: data analysis pyspark univariate bivariate statistics
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pyspark>=3.0.0
Requires-Dist: matplotlib>=3.0.0
Requires-Dist: numpy>=1.17.0
Requires-Dist: scipy>=1.4.0
Requires-Dist: seaborn>=0.11.0

# pyspark_eda

`pyspark_eda` is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate and bivariate analysis, handling missing values, outliers, and visualizing data distributions.

## Features

- **Univariate analysis:** Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.
- **Bivariate analysis:** Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required. 
- **Automatic handling:** Deals with missing values and outliers seamlessly.
- **Visualization:** Provides graphical representation of data distributions and relationships.

## Installation
You can install `pyspark_eda` via pip:

```bash
pip install pyspark_eda
```
## Function
### Univariate Analysis 
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_list** (*list*, optional): List of columns to exclude from analysis.
- **print_graphs** (*int*, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.

### Description
Performs univariate analysis on the DataFrame and prints summary statistics and visualizations.
It returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution.
You can display the table to view the results.

### Example Usage
## get_univariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']

# Perform univariate analysis
get_univariate_analysis(df, table_name="your_table_name", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)
```

## Function
### Bivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **categorical_columns** (*list*): The categorical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.
- **correlation_analysis** (*int, optional*): Whether to perform correlation analysis (1 for yes, 0 for no),default value is 1.
- **cramer_analysis** (*int, optional*): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 1.
- **anova_analysis** (*int, optional*): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 1.
- **print_graphs** (*int, optional*): Whether to print graphs (1 for yes, 0 for no),default value is 0.

### Description
Performs bivariate analysis on the DataFrame, including Pearsons Correlation, Cramer's V, and ANOVA.
It returns a table with the following columns: Column_1, Column_2, Correlation_Coefficient, Cramers_V, Anova_F_Value,Anova_P_Value.
You can display the table to view the results.

### Example Usage 
### get_bivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']

# Perform bivariate analysis
get_bivariate_analysis(df, table_name="bivariate_analysis_results", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=1)
```

## Function
### Multivariate Analysis
### Parameters
- **df** (*DataFrame*): The input PySpark DataFrame.
- **table_name** (*str*): The base table name to save the results
- **numerical_columns** (*list*): The numerical columns of the table on which you want the analysis to be performed.
- **id_columns** (*list, optional*): List of columns to exclude from analysis.
- **vif_analysis** (*int, optional*): Whether to get VIF values (1 for yes, 0 for no),default value is 1.
- **decision_tree_analysis** (*int, optional*): Whether to perform decision tree (1 for yes, 0 for no), default value is 1.
- **target_column** (*str, optional*): What the target column for the decision tree is,default value is None.It is compulsary if you are perfomring decision tree.
- **depth** (*int, optional*): The depth of the decision tree ,default value is 3.

### Description
Performs multivariate analysis on the DataFrame, including Variance Inflation Factor (VIF), and Decision Tree.
It returns a table with the following columns: Feature, VIF. You can display the table to view the results. 
You get a png file with the decision tree. 

### Example Usage 
### get_multivariate_analysis
```python
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Identify numerical columns
numerical_columns = ['col1', 'col2', 'col3']

# Perform bivariate analysis
get_bivariate_analysis(df, table_name="multivariate_analysis_results", numerical_columns=numerical_columns, id_columns=['id_column'], vif_analysis=1, decision_tree_analysis=1, target_column="target_column_name", depth=3)
```

## Contact
- **Author:** Tanya Irani
- **Email:** tanyairani22@gmail.com
