Metadata-Version: 2.4
Name: edasuite
Version: 0.0.5
Summary: A Python library for exploratory data analysis with advanced statistical features
Author-email: LattIQ Development Team <dev@lattiq.com>
License: MIT
Project-URL: Homepage, https://github.com/lattiq/edasuite
Project-URL: Repository, https://github.com/lattiq/edasuite
Keywords: data analysis,exploratory data analysis,eda
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<3.0.0,>=2.0.0
Requires-Dist: numpy<2.0.0,>=1.24.0
Requires-Dist: scipy<2.0.0,>=1.10.0
Requires-Dist: pyarrow>=10.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-xdist>=3.0.0; extra == "test"
Dynamic: license-file

# EDASuite

A comprehensive Python library for exploratory data analysis with advanced features for data profiling, quality assessment, and stability monitoring.

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Interactive Viewer

EDASuite includes a built-in interactive dashboard to explore your analysis results in the browser.

```python
from edasuite.viewer.server import serve_results

# From a saved JSON file
serve_results("eda_results.json")

# Or directly from EDA results
results = runner.run(data=df, schema=schema, target_variable="target")
serve_results(results)
```

**Summary** — Dataset overview, insights, top features by IV, data quality score, and provider match rates.

![Summary](docs/images/viewer-summary.png)

**Catalog** — Sortable feature table with type, provider, target correlation, IV, and PSI at a glance.

![Catalog](docs/images/viewer-catalog.png)

**Deep Dive** — Per-feature detail view with statistics, box plots, distribution charts, target associations, and correlations.

![Deep Dive](docs/images/viewer-deepdive.png)

**Associations** — Mixed-method heatmap (Pearson, Theil's U, Eta) showing relationships across all features.

![Associations](docs/images/viewer-associations.png)

## Features

### Core Analysis
- **Automated Feature Analysis**
  - Continuous features: mean, median, std, quartiles, skewness, kurtosis, outliers
  - Categorical features: mode, value counts, cardinality, entropy
  - Automatic type inference with schema override support
  - Missing value detection and sentinel value replacement

### Advanced Statistics
- **Target Relationship Analysis**
  - Information Value (IV) and Weight of Evidence (WoE)
  - Optimal binning for continuous features
  - Predictive power classification
  - Statistical significance testing

- **Correlation & Association Analysis**
  - Pearson and Spearman correlations with p-values
  - Theil's U (asymmetric categorical associations)
  - Eta / Correlation Ratio (categorical-continuous)
  - Unified association matrix across all feature types
  - Top-N correlation tracking per feature

### Data Quality
- **Quality Assessment System**
  - Automated quality scoring (0-10 scale)
  - Per-feature quality flags (high_missing, low_variance, constant, outliers)
  - Overall dataset quality metrics
  - Actionable recommendations

- **Sentinel Value Handling**
  - Automatic detection and replacement of no-hit values
  - Provider-specific default value handling
  - Preserves integer dtypes using pandas nullable types (e.g. Int64) to avoid silent float upcasting
  - Configurable via DatasetSchema

### Stability Monitoring
- **Cohort-Based Stability**
  - PSI (Population Stability Index) for categorical features
  - KS (Kolmogorov-Smirnov) test for continuous features
  - Train/test drift detection
  - Feature-level stability metrics

- **Time-Based Stability**
  - Multiple time window strategies (monthly, weekly, quartile, custom)
  - Temporal trend analysis (increasing, decreasing, volatile)
  - Auto-detection of optimal time periods
  - Minimum sample size enforcement

### Provider Analytics
- **Provider Match Rates**
  - Automatic detection via `<provider>_record_not_found` columns
  - Data coverage statistics by provider (% of records with data)
  - Feature-level availability tracking
  - Not-found record counts per provider
  - Supports both column-based and schema-based detection

### Performance
- **Large Dataset Support**
  - Multiple file format support (CSV, Parquet)
  - Chunked CSV reading for files >100MB
  - Configurable sampling for faster analysis
  - Memory-efficient correlation computation
  - Tested with 100K+ rows, 400+ features

## Installation

```bash
pip install edasuite
```

## Quick Start

### Basic Usage

```python
from edasuite import EDARunner, DataLoader
import pandas as pd

# Option 1: Load from file using DataLoader
df = DataLoader.load_csv("data.csv")

# Option 2: Use existing DataFrame
df = pd.read_csv("data.csv")  # or from database, etc.

# Initialize runner
runner = EDARunner(
    max_categories=50,
    top_correlations=10
)

# Run analysis
results = runner.run(
    data=df,
    output_path="eda_results.json"
)
```

### Loading Data

EDASuite provides `DataLoader` utilities for loading data:

```python
from edasuite import DataLoader

# Load CSV
df = DataLoader.load_csv("data.csv")

# Load Parquet (faster for large files)
df = DataLoader.load_parquet("data.parquet")

# Load with sampling
df = DataLoader.load_csv("large_file.csv", sample_size=10000)
```

### With DatasetSchema

```python
from edasuite import (
    EDARunner, DataLoader,
    ColumnConfig, ColumnType, ColumnRole, Sentinels, DatasetSchema,
)

# Load data and schema
df = DataLoader.load_csv("data.csv")
schema = DataLoader.load_schema("schema.json")

# Or create schema programmatically
schema = DatasetSchema([
    ColumnConfig('age', ColumnType.CONTINUOUS, ColumnRole.FEATURE,
                 provider='demographics', description='User age',
                 sentinels=Sentinels(not_found='-1')),
    ColumnConfig('zip_code', ColumnType.CATEGORICAL, ColumnRole.FEATURE,
                 provider='address', description='ZIP code',
                 sentinels=Sentinels(not_found='', missing='00000')),
    ColumnConfig('target', ColumnType.BINARY, ColumnRole.TARGET),
])

# Run with schema
runner = EDARunner()
results = runner.run(
    data=df,
    schema=schema,
    target_variable="target",
    output_path="eda_results.json"
)
```

**Schema JSON format** (`schema.json`):
```json
{
  "columns": [
    {
      "name": "age",
      "type": "continuous",
      "role": "feature",
      "provider": "demographics",
      "description": "User age",
      "sentinels": {
        "not_found": "-1",
        "missing": null
      }
    }
  ]
}
```

### Working with DataFrames

EDARunner works with pandas DataFrames, making it easy to integrate into existing data pipelines:

```python
import pandas as pd
from edasuite import EDARunner

# From database
df = pd.read_sql("SELECT * FROM users", connection)

# From API
import requests
data = requests.get("https://api.example.com/data").json()
df = pd.DataFrame(data)

# In-memory transformations
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100])

# Run EDA
runner = EDARunner()
results = runner.run(data=df, target_variable='target')
```

This is particularly useful for:
- Working in Jupyter notebooks
- Data loaded from databases (via `pd.read_sql()`)
- In-memory transformations without saving to disk
- Integration with existing data pipelines

See [examples/example_12_dataframe_input.py](examples/example_12_dataframe_input.py) for more examples.

### Stability Analysis

#### Cohort-Based (Train/Test)

```python
from edasuite import EDARunner, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for stability analysis
runner = EDARunner(
    calculate_stability=True,
    cohort_column='dataTag',
    baseline_cohort='training',
    comparison_cohort='test'
)

results = runner.run(
    data=df,
    schema=schema
)
```

#### Time-Based

```python
from edasuite import EDARunner, DataLoader

# Load data and schema
df = DataLoader.load_parquet("data.parquet")
schema = DataLoader.load_schema("schema.json")

# Configure for time-based stability
runner = EDARunner(
    time_based_stability=True,
    time_column='onboarding_time',
    time_window_strategy='monthly',  # or 'weekly', 'quartiles', 'custom'
    baseline_period='first',
    comparison_periods='all',
    min_samples_per_period=100
)

results = runner.run(
    data=df,
    schema=schema
)
```

## DatasetSchema

`DatasetSchema` enables advanced functionality by defining column types, roles, providers, and sentinel values:

### Variable Type Override
Override automatic type inference:
```json
{
  "name": "customer_id",
  "type": "categorical",
  "role": "feature"
}
```

### Sentinel Values
Define values that should be treated as missing:
```json
{
  "name": "income",
  "type": "continuous",
  "role": "feature",
  "sentinels": {
    "not_found": "-1",
    "missing": "0"
  }
}
```

### Provider Tracking
Track data sources:
```json
{
  "name": "credit_score",
  "type": "continuous",
  "role": "feature",
  "provider": "bureau_provider",
  "description": "FICO credit score"
}
```

## Output Format

EDASuite produces structured JSON output with three top-level sections:

### metadata
- Timestamp, execution time, version
- Configuration (target variable, sampling, correlations)
- Schema availability indicator

### summary
- Feature type distribution and counts
- Data quality score with recommendations
- Dataset info (rows, columns, memory, missing, duplicates)
- Provider match rates (if schema with providers is used)
- Feature counts across 16+ categories (see [Feature Counts](#feature-counts))
- Association matrix (Pearson, Theil's U, Eta merged into a single N×N structure)
- Top features by statistical score

### features
List of per-feature analysis, each including:
- Statistics (mean, median, mode, quartiles, etc.)
- Distribution (histogram or value counts)
- Missing values
- Quality assessment
- Correlations (with target and other features)
- Target relationship (IV, WoE if target specified)
- Stability (PSI/KS if enabled)

## Provider Match Rates / Hit Rates

EDASuite automatically computes provider match rates (also called "hit rates") to help you understand data coverage from different third-party data providers.

### Automatic Detection

Provider match rates are computed automatically during EDA using one of two methods:

#### Method 1: Using `<provider>_record_not_found` columns (Preferred)

If your dataset includes columns like `payu_record_not_found`, `truecaller_record_not_found`, etc., EDASuite will automatically detect and use them:

```python
runner = EDARunner()
df = DataLoader.load_csv("data.csv")
results = runner.run(data=df)

# Access provider stats
provider_stats = results['summary']['provider_match_rates']
```

#### Method 2: Using DatasetSchema (Fallback)

If no `record_not_found` columns exist, you can use a schema to group features by provider:

```python
df = DataLoader.load_csv("data.csv")
schema = DataLoader.load_schema("schema.json")

runner = EDARunner()
results = runner.run(data=df, schema=schema)

# Provider stats show match rates based on feature null analysis
provider_stats = results['summary']['provider_match_rates']
```

### Example

See [examples/example_10_provider_match_rates.py](examples/example_10_provider_match_rates.py) for a complete working example.

## Feature Counts

EDASuite automatically computes feature counts across 16+ categories — useful for dashboards, feature selection, and data quality monitoring.

### Automatic Computation

Feature counts are computed automatically during EDA and included in the results:

```python
from edasuite import EDARunner, DataLoader

runner = EDARunner()
df = DataLoader.load_csv("data.csv")
results = runner.run(
    data=df,
    target_variable="target"  # Required for correlation and IV
)

# Access feature counts
feature_counts = results['summary']['feature_counts']

print(f"High Correlation: {feature_counts['high_correlation']['count']}")
print(f"Redundant Features: {feature_counts['redundant_features']['count']}")
print(f"High IV: {feature_counts['high_iv']['count']}")
print(f"High Stability: {feature_counts['high_stability']['count']}")
```

### Categories

**Target Relationship**

| Category | Threshold | Description |
|----------|-----------|-------------|
| **High Correlation** | best association > 0.1 | Features associated with target (Pearson, Eta, or Theil's U) |
| **High IV** | IV > 0.1 | Features with strong predictive power |
| **Significant Correlations** | p-value < 0.05 | Statistically significant target correlations |
| **Suspected Leakage** | IV > 0.5 | Features with suspiciously high predictive power |

**Feature Quality**

| Category | Threshold | Description |
|----------|-----------|-------------|
| **Redundant Features** | correlation > 0.7 | Highly correlated with another feature |
| **High Missing** | > 30% | Features with substantial missing values |
| **Constant Features** | 1 unique value | Zero-variance features |
| **Low Variance** | low CV | Features with very low coefficient of variation |
| **Not Recommended** | composite | Features flagged as unsuitable for modeling |
| **Highly Skewed** | \|skewness\| > 1.0 | Features with heavy distributional skew |
| **High Kurtosis** | kurtosis > 3.0 | Outlier-prone features |
| **High Cardinality** | — | Categoricals with high unique-value ratio |

**Predictive Power Breakdown**

| Category | Description |
|----------|-------------|
| **Predictive Power** | Count of features by IV class: unpredictive, weak, medium, strong, very strong |

**Stability**

| Category | Threshold | Description |
|----------|-----------|-------------|
| **High Stability** | PSI < 0.1 | Stable distribution across cohorts/time |
| **Minor Shift** | 0.1 ≤ PSI < 0.2 | Minor distribution drift |
| **Major Shift** | PSI ≥ 0.2 | Major distribution drift |
| **Increasing Drift** | — | Worsening distribution drift over time |
| **Volatile Stability** | — | Inconsistent stability across periods |

### Example

See [examples/example_11_feature_counts.py](examples/example_11_feature_counts.py) for a complete working example with UI formatting.

## Advanced Configuration

### Correlation Settings

```python
runner = EDARunner(
    top_correlations=10,           # Top N correlations per feature
    max_correlation_features=500   # Limit features in correlation matrix
)
```

### Sampling for Large Datasets

```python
runner = EDARunner(
    sample_size=10000  # Analyze sample of 10K rows
)
```

### Custom Column Selection

```python
df = DataLoader.load_csv("data.csv")
results = runner.run(
    data=df,
    columns=['age', 'income', 'zip_code']  # Analyze specific columns
)
```

### Compact JSON Output

```python
df = DataLoader.load_csv("data.csv")
results = runner.run(
    data=df,
    output_path="results.json",
    compact_json=True  # Minimize JSON size
)
```

### Parquet File Benefits

Parquet format offers significant advantages:
- **Faster loading**: Columnar format with efficient compression
- **Smaller file size**: Typically 50-80% smaller than CSV
- **Type preservation**: Maintains data types (no type inference needed)
- **Column selection**: Read only needed columns (reduces memory usage)

```python
# Convert CSV to Parquet (one-time operation)
import pandas as pd
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet", index=False)

# Then use Parquet for faster analysis
df = DataLoader.load_parquet("data.parquet")
runner = EDARunner()
results = runner.run(data=df)
```

## Development

```bash
pip install -e .           # Install for development
python -m build            # Build package
python -m pytest tests/    # Run tests
```

## Documentation

- [Architecture](docs/ARCHITECTURE.md) — internals, module structure, data flow
- [Decision Records](docs/decisions/) — key design decisions and rationale
- [Examples](examples/) — usage examples and demos

## Requirements

- Python 3.9+
- pandas >= 2.0.0
- numpy >= 1.24.0
- scipy >= 1.10.0
- pyarrow >= 10.0.0 (for Parquet support)

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Contact

For questions or suggestions:
- Email: dev@lattiq.com
- GitHub: [https://github.com/lattiq/edasuite](https://github.com/lattiq/edasuite)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
