Metadata-Version: 2.4
Name: datasetsense
Version: 0.1.0
Summary: Automated EDA Narrator + Data Quality Scoring Tool
Project-URL: Homepage, https://github.com/Alan-Turing18this-Year/Automated-EDA-Narrator-Data-Quality-Scoring-Tool
Author: Mark Oraño, Jomar Ligas, Lex Lumantas, Philip Tupas, Josh Ganhinhin
License: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.8
Requires-Dist: numpy
Requires-Dist: pandas>=1.5
Requires-Dist: python-dateutil
Requires-Dist: scipy
Requires-Dist: tabulate
Description-Content-Type: text/markdown

<img width="1920" height="1080" alt="1" src="https://github.com/user-attachments/assets/24f98fd5-f0c4-46ab-b243-3c65dcf2b622" />
<img width="1920" height="1080" alt="2" src="https://github.com/user-attachments/assets/bec9141f-2ea4-4c8c-9a79-cbfa335a58d7" />
<img width="1920" height="1080" alt="3" src="https://github.com/user-attachments/assets/2fc72025-7b48-4ef0-8899-1cd6b572b058" />
<img width="1920" height="1080" alt="4" src="https://github.com/user-attachments/assets/8aaa4d32-a7d3-4528-92e3-27dde9fa24fe" />
<img width="1920" height="1080" alt="5" src="https://github.com/user-attachments/assets/3cbe67a9-aaf3-4b51-aeec-e00280c16113" />
<img width="1920" height="1080" alt="6" src="https://github.com/user-attachments/assets/ee43d952-aff8-4c6a-90ad-3e72345b019d" />
<img width="1920" height="1080" alt="7" src="https://github.com/user-attachments/assets/4665c6f1-db98-4d84-a024-114e06437b84" />
<img width="1920" height="1080" alt="8" src="https://github.com/user-attachments/assets/4ccc88ac-0636-4544-a321-193f5826f671" />
<img width="1920" height="1080" alt="9" src="https://github.com/user-attachments/assets/ea671bbd-6a6e-4eda-af94-33f35daf2b2b" />
<img width="1920" height="1080" alt="10" src="https://github.com/user-attachments/assets/839655ed-b3bf-4cec-97ff-1f11f00787e7" />


# DatasetSense: Automated EDA Narrator + Data Quality Scoring Tool
## 1. Project Overview
DatasetSense is a Python tool that performs **automated exploratory data analysis (EDA)** and computes a **dataset quality score (0–100)**. It generates **human-readable insights** and produces a **markdown report** summarizing dataset characteristics and quality.  

The project demonstrates **object-oriented programming (OOP)** concepts including **encapsulation, inheritance, polymorphism, composition, and dunder methods**.

---

##  Features

### Automated EDA
- Statistical profiling (mean, std, quartiles)
- Categorical profiling (frequency distribution, unique ratio)
- Outlier detection summary
- Missing value analysis per feature
- Duplicate row detection

### Data Quality Intelligence
| Metric          | Basis                    | Weight |
| --------------- | ------------------------ | ------ |
| Missing Score   | % missing values         | 35%    |
| Duplicate Score | duplicate row %          | 15%    |
| Outlier Score   | detected outliers vs N   | 25%    |
| Balance Score   | categorical distribution | 25%    |
- Missing values, duplicates, outliers, balance score
-   Final weighted score (0–100)
-   Quality verdict: Excellent / Good / Fair / Poor
-   Supports custom weights for flexible scoring strategies

### Natural-Language Narration
- Generates explanation of dataset shape, variability, missing values, outliers & verdict
- Converts analysis metrics into human-readable insights

### Automated Report Generation
- Markdown export (.md)
- CLI configurable output
- Integrates narratives + scores + stats into a clean report
  
---
## Installation

Clone the Repository
```bash
git clone https://github.com/LexusMaximus/Automated-EDA-Narrator-Data-Quality-Scoring-Tool.git
cd Automated-EDA-Narrator-Data-Quality-Scoring-Tool
```
Install Dependencies
```bash
pip install -r requirements.txt
```
If installing manually:

```bash
pip install pandas>=1.5 numpy scipy tabulate python-dateutil
```
---

## System Architecture (UML)

![Dataset UML](pics/dataset_uml.png)

The UML expresses class collaboration via composition: 

DatasetPipeline → DataLoader → Preprocessor → EDAAnalyzer → QualityScorer → Narrator → ReportBuilder


---

## Object-Oriented Design

| OOP Concept        | How it’s applied in your project                                                                                                                                                                             |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Classes**        | There are **6 core classes**: `DataLoader`, `Preprocessor`, `EDAAnalyzer` (base), `NumericAnalyzer`/`CategoricalAnalyzer` (children), `QualityScorer`, `Narrator`, `ReportBuilder`, and `DatasetPipeline`.  |
| **Encapsulation**  | Protected attributes (e.g., `_df`, `_eda`, `_scores`) are used in classes. Getters like `get_df()` in `Preprocessor` and `DataLoader` provide controlled access.                                            |
| **Inheritance**    | `NumericAnalyzer` and `CategoricalAnalyzer` **inherit** from `EDAAnalyzer`.                                                                                                                                 |
| **Polymorphism**   | `run_all()` is **overridden** in `NumericAnalyzer` and `CategoricalAnalyzer` to handle numeric vs categorical data differently.                                                                             |
| **Dunder Methods** | `DataLoader` has `__repr__`, `__eq__`, `__len__`; `DatasetPipeline` has `__repr__`.                                                                                                                         |
| **Composition**    | `DatasetPipeline` **contains/uses** instances of `DataLoader`, `Preprocessor`, `EDAAnalyzer`, `QualityScorer`, `Narrator`, `ReportBuilder`.                                                                 |


---

## Project Structure

```
data-narrator/
├─ data/                    # CSV files and sample datasets
│  └─ sample.csv
├─ src/                     # Main modules (importable and reusable)
│  ├─ __init__.py
│  ├─ loader.py             # Loads CSV files
│  ├─ preprocessor.py       # Cleans and preprocesses data
│  ├─ eda_analyzer.py       # Numeric and categorical EDA analysis
│  ├─ quality_scorer.py     # Computes data quality scores
│  ├─ narrator.py           # Generates human-readable insights
│  ├─ report_builder.py     # Builds markdown reports
│  └─ orchestrator.py       # DatasetPipeline: orchestrates all classes
├─ demo.py                  # Ready-to-run mini demo for practical example
├─ tests/                   # Unit tests (optional)
├─ notebooks/               # Jupyter notebooks for exploration (optional)
├─ README.md                # Project documentation
└─ requirements.txt         # Python dependencies
```

---

| Requirement                                  | Project Implementation                                                                                                                                                                                                                                                                                      |
| -------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **At least 5 useful methods across modules** | Example methods: <br>1. `DataLoader.load()` – loads CSV <br>2. `Preprocessor.trim_strings()` – trims text columns <br>3. `NumericAnalyzer.run_all()` – numeric summary <br>4. `QualityScorer.overall_score()` – calculates weighted quality <br>5. `Narrator.generate()` – returns human-readable narrative |
| **Must be importable and reusable**          | All modules are in `src/` with proper `__init__.py`, allowing imports like: <br>`from src.loader import DataLoader`                                                                                                                                                                                         |
---

## Usage & Testing

Run on Any CSV (Python Script)

```python
from src.orchestrator import DatasetPipeline

pipeline = DatasetPipeline("data/sample.csv")
report = pipeline.run()
print(report)  # Prints markdown report to console
```
Run Pipeline with Custom Weights

```python
custom_weights = {
    'missing': 0.50,   # prioritize missing values
    'duplicates': 0.10,
    'outliers': 0.20,
    'balance': 0.20
}

pipeline_custom = DatasetPipeline("data/sample.csv", custom_weights=custom_weights)
report_custom = pipeline_custom.run()
print(report_custom)
```

Run via CLI

```bash
python src/cli.py data/sample.csv --out reports/sample_report.md
python src/cli.py data/sample.csv --weights '{"missing":0.5,"duplicates":0.1,"outliers":0.2,"balance":0.2}'
```

Terminal confirmation:

```bash
Wrote report to reports/sample_report.md
```

Run in Google Colab / Jupyter

```bash
!git clone https://github.com/LexusMaximus/Automated-EDA-Narrator-Data-Quality-Scoring-Tool.git
```

```python
import sys
sys.path.insert(0, '/content/Automated-EDA-Narrator-Data-Quality-Scoring-Tool/src')

from orchestrator import DatasetPipeline

pipeline = DatasetPipeline("Automated-EDA-Narrator-Data-Quality-Scoring-Tool/data/sample.csv")
report = pipeline.run()
print(report)
```
Compare Multiple Weight Configurations

```python
import pandas as pd

weight_configs = {
    'Default': {'missing':0.35, 'duplicates':0.15, 'outliers':0.25, 'balance':0.25},
    'Missing Focus': {'missing':0.50, 'duplicates':0.10, 'outliers':0.20, 'balance':0.20},
    'Outlier Focus': {'missing':0.20, 'duplicates':0.30, 'outliers':0.40, 'balance':0.10},
    'Equal Weights': {'missing':0.25, 'duplicates':0.25, 'outliers':0.25, 'balance':0.25},
    'Balance Focus': {'missing':0.20, 'duplicates':0.20, 'outliers':0.20, 'balance':0.40}
}

results = []
for name, weights in weight_configs.items():
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=weights)
    pipeline.run()
    results.append({
        'Configuration': name,
        'Overall Score': round(pipeline.scores['overall'], 2),
        'Missing Weight': weights['missing'],
        'Duplicates Weight': weights['duplicates'],
        'Outliers Weight': weights['outliers'],
        'Balance Weight': weights['balance']
    })

comparison_df = pd.DataFrame(results)
print(comparison_df.to_string(index=False))
```
Error Handling - Invalid Weights

```python
try:
    invalid_weights = {'missing':0.5,'duplicates':0.3,'outliers':0.3,'balance':0.1}
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=invalid_weights)
    pipeline.run()
except ValueError as e:
    print(f"✓ Error correctly caught: {e}")
```
Run entire test suite

```bash
pytest
```
