Metadata-Version: 2.1
Name: data-auto-profiler
Version: 0.1.1
Summary: Automated data profiling and visualization tool
Home-page: https://github.com/CharlesMaponya/autoprofile
Author: Charles Maponya
Author-email: maponyacl@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.2.2
Requires-Dist: numpy>=1.26.4
Requires-Dist: plotly>=5.22.0
Requires-Dist: scipy>=1.13.1
Requires-Dist: nbformat>=4.2.0

# Data-Auto-Profiler

Data-Auto-Profiler is a powerful Python package designed to streamline the data analysis process by automatically generating comprehensive insights about your datasets. Whether you're a data scientist looking to quickly understand a new dataset or an analyst preparing a detailed data quality report, Data-Auto-Profiler provides the tools you need to uncover meaningful patterns and potential issues in your data.

## Understanding Data-Auto-Profiler's Core Features

Data-Auto-Profiler excels at four key areas of data analysis:

### Data Quality Assessment

At its heart, Data-Auto-Profiler helps you understand the reliability and completeness of your data. The package automatically evaluates your dataset for common quality issues by:

- Calculating an overall completeness score that tells you at a glance how much of your data is actually usable
- Identifying missing values and their patterns across different features
- Detecting numerical outliers that might skew your analysis
- Finding duplicate records that could affect your model's performance
- Determining the most appropriate data type for each column

### Statistical Analysis

Data-Auto-Profiler performs a thorough statistical examination of your data, helping you understand the underlying distributions and characteristics of each feature. This includes:

- Computing essential descriptive statistics like mean, median, and standard deviation
- Analyzing the shape of your data distributions through skewness and kurtosis measurements
- Calculating variance to understand the spread of your numerical features
- Generating visualizations that make these statistics intuitive and actionable

### Feature Relationships

Understanding how different features relate to each other is crucial for any data analysis project. Data-Auto-Profiler provides several methods to explore these relationships:

- Calculating Pearson correlations between numerical features to identify linear relationships
- Using Cramér's V analysis to understand associations between categorical variables
- Creating interactive pairplot visualizations that let you explore relationships visually
- Analyzing how each feature relates to your target variable

### Predictive Power Assessment

For machine learning projects, Data-Auto-Profiler helps you understand which features might be most useful through:

- Information Value (IV) calculations that measure each feature's predictive strength
- Feature importance rankings that help you prioritize which variables to focus on
- Detailed analysis of how each feature relates to your target variable

## Getting Started with Data-Auto-Profiler

### Installation

First, install Data-Auto-Profiler using pip:

```bash
pip install data-auto-profiler
```

The package requires several common data science libraries:

```bash
pip install pandas numpy plotly scipy
```

### Basic Usage

Here's a simple example to get you started:

```python
import pandas as pd
from data_auto_profiler import AutoProfile

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Create an Data-Auto-Profiler instance
# The target_column parameter tells Data-Auto-Profiler which variable you're trying to predict
profiler = AutoProfile(data=data, target_column='target')

```

### Detailed Analysis Methods

Let's explore each analysis method in detail:

#### Completeness Analysis
```python
profiler.summary(completeness=True)
```

This generates an intuitive gauge chart showing your dataset's overall completeness on a 0-100% scale. The visualization uses color coding to quickly communicate data quality:
- Green (≥80%): Excellent completeness
- Orange (50-80%): Moderate completeness
- Red (≤50%): Poor completeness

#### Missing Value Analysis
```python
profiler.summary(missing=True)
```

This creates a detailed bar chart showing the percentage of missing values in each column. This visualization helps you:
- Identify which features have the most missing data
- Understand patterns in data collection issues
- Make informed decisions about imputation strategies

#### Outlier Detection
```python
profiler.summary(outliers=True)
```

Using the Interquartile Range (IQR) method, this analysis identifies statistical outliers in your numerical features. The resulting visualization shows:
- The number of outliers per feature
- Their distribution within the data
- Potential data quality issues that need investigation

#### Feature Importance Analysis with Information Value/gain
```python
profiler.summary(iv_analysis=True)
```

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:
- Very strong predictors (IV > 0.3)
- Strong predictors (0.1 ≤ IV < 0.3)
- Medium predictors (0.02 ≤ IV < 0.1)
- Weak predictors (IV < 0.02)

#### Cramér's V Association Analysis

```python
profiler.summary(cramers_v_analysis=True)
```

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:
- Very strong predictors (IV > 0.3)
- Strong predictors (0.1 ≤ IV < 0.3)
- Medium predictors (0.02 ≤ IV < 0.1)
- Weak predictors (IV < 0.02)

### Correlation Analysis

```python
profiler.summary(correlation=True)
```
This analysis calculates the correlation between all pairs of numerical features, providing insights into:
- Pearson correlation for numeric features
- Cramér's V for categorical features

### Autodistribution plots for a given column


```python
profiler.summary(distribution='column_name')
```

This analysis generates an autodistribution plot for the specified column, providing insights into its distribution and potential data skewedness. it accounts for binary columns, categorical columns and numeric columns.


## Important Considerations

When using Data-Auto-Profiler, keep these points in mind:

1. Memory Usage: The pairplot analysis can be memory-intensive for large datasets with many features. Consider using it selectively on smaller feature sets.

2. Performance: Some analyses, particularly Cramér's V calculations for categorical variables, may take longer with large datasets or features with many unique values.

3. Target Variable Requirements: For certain analyses like Information Value calculations, your target variable must be numeric.

4. Missing Value Handling: While Data-Auto-Profiler handles missing values automatically, their presence may affect certain statistical calculations.


## 📋 Requirements

- Python 3.8+
- pandas
- numpy
- plotly
- scipy



## Contributing to Data-Auto-Profiler

We welcome contributions from the community! If you'd like to improve Data-Auto-Profiler:

1. Fork the repository
2. Create a new branch for your feature
3. Submit a pull request with your changes

For significant changes, please open an issue first to discuss your proposed modifications.

### Contribution Guidelines

- Follow PEP 8 Style Guide
- Write Comprehensive Tests
- Document New Features
- Maintain Code Quality


## License

Data-Auto-Profiler is available under the MIT License, allowing for both personal and commercial use with proper attribution.

## 📞 Support

- Open GitHub Issues
- Email: \[maponyacl@gmail.com\]

## 🌟 Acknowledgements

- Inspired by data science community
- Built with ❤️ for data explorers

## 🚀 Future Roadmap

-  Machine Learning Model Integration
-  Advanced Anomaly Detection
-  Enhanced Visualization Themes
