Metadata-Version: 2.4
Name: octolearn
Version: 0.7.7
Summary: Structured AutoML Pipeline with Intelligent Dataset Profiling
Home-page: https://github.com/ghulam-nabeel/octolearn
Author: Ghulam_Muhammad_Nabeel
Author-email: Ghulam Muhammad Nabeel <ghulam.nabeel@example.com>
License: MIT
Project-URL: Homepage, https://github.com/GhulamMuhammadNabeel/Octolearn
Project-URL: Bug Tracker, https://github.com/GhulamMuhammadNabeel/Octolearn
Project-URL: Documentation, https://github.com/GhulamMuhammadNabeel/Octolearn
Keywords: automl,machine-learning,data-science,profiling,automation
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.0.0
Requires-Dist: numpy>=1.19.0
Requires-Dist: scikit-learn>=0.24.0
Requires-Dist: optuna>=2.0.0
Requires-Dist: reportlab>=3.6.0
Requires-Dist: matplotlib>=3.3.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: shap>=0.40.0
Provides-Extra: distributed
Requires-Dist: dask[complete]>=2021.9.0; extra == "distributed"
Requires-Dist: ray[default]>=2.0.0; extra == "distributed"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

<p align="center">
  <img src="octolearn/logo/octolearn_logo.png" alt="OctoLearn Logo" width="200"/>
</p>

<h1 align="center">🐙 OctoLearn</h1>

<p align="center">
  <strong>Enterprise-Grade AutoML for Python</strong><br>
  Profile → Clean → Engineer → Train → Report — in one line of code.
</p>

<p align="center">
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-features">Features</a> •
  <a href="#-installation">Installation</a> •
  <a href="#-advanced-usage">Advanced Usage</a> •
  <a href="#-api-reference">API Reference</a> •
  <a href="#-architecture">Architecture</a>
</p>

---

## ✨ Features

| Feature | Description |
|---------|-------------|
| **🔍 Smart Profiling** | Auto-detects column types, task type, leakage suspects, class imbalance |
| **🧹 Auto Cleaning** | Imputation, encoding, scaling — all learned on train, applied to test |
| **⚠️ Risk Scoring** | 0–100 data quality risk score with detailed factor breakdown |
| **🔧 Feature Engineering** | Outlier detection (IQR, Z-score, Isolation Forest) + interaction analysis |
| **🤖 Model Training** | Trains 5+ models with Optuna hyperparameter optimization |
| **📊 PDF Reports** | Professional cyberpunk-themed reports with charts & SHAP analysis |
| **💾 Model Registry** | Version-controlled model storage with metadata tracking |
| **⚡ Parallel Processing** | Multi-core support for faster training and optimization |

---

## 📦 Installation

### From Source (Development)

```bash
git clone https://github.com/GhulamMuhammadNabeel/OctoLearn.git
cd OctoLearn
python -m venv .venv

# Windows
.\.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

pip install -e .
```

### Dependencies

OctoLearn requires Python 3.8+ and installs the following:

| Package | Purpose |
|---------|---------|
| `pandas`, `numpy` | Data manipulation |
| `scikit-learn` | ML models & preprocessing |
| `optuna` | Hyperparameter optimization |
| `reportlab` | PDF report generation |
| `matplotlib`, `seaborn` | Visualization |
| `shap` | Model explainability |
| `joblib` | Model serialization |

---

## 🚀 Quick Start

### For Beginners — One Line Pipeline

```python
from octolearn import AutoML
import pandas as pd

# Load your data
data = pd.read_csv("your_data.csv")
X = data.drop("target_column", axis=1)
y = data["target_column"]

# Run the entire pipeline
automl = AutoML()
automl.fit(X, y)

# Get results
print(automl.raw_profile_)          # Dataset profiling results
print(automl.get_risk_score())      # Data quality risk score
print(automl.get_recommendations()) # ML recommendations
```

### Profile Only (No Training)

```python
automl = AutoML(train_models=False)
automl.fit(X, y)

# Access insights
profile = automl.raw_profile_
print(f"Rows: {profile.n_rows}, Columns: {profile.n_columns}")
print(f"Task type: {profile.task_type}")
print(f"Missing values: {profile.missing_ratio}")

# Risk assessment
risk = automl.get_risk_score()
print(f"Risk: {risk['score']}/100 ({risk['category']})")
```

### Generate a PDF Report

```python
automl = AutoML()
automl.fit(X, y)
automl.generate_report()  # Creates a professional PDF report
```

---

## 🔧 Advanced Usage

### Full Configuration Control

Every aspect of OctoLearn is configurable through dataclass objects:

```python
from octolearn import (
    AutoML,
    DataConfig,
    ProfilingConfig,
    PreprocessingConfig,
    ModelingConfig,
    OptimizationConfig,
    ReportingConfig,
    ParallelConfig,
)

automl = AutoML(
    # Data handling
    data_config=DataConfig(
        use_full_data=False,     # Sample large datasets
        sample_size=1000,        # Rows to sample
        test_size=0.2,           # Train/test split ratio
        random_state=42,         # Reproducibility
    ),

    # Profiling behavior
    profiling_config=ProfilingConfig(
        detect_outliers=True,
        analyze_interactions=True,   # Enable interaction analysis
        generate_risk_score=True,
        calculate_feature_importance=True,
    ),

    # Preprocessing strategy
    preprocessing_config=PreprocessingConfig(
        auto_clean=True,
        imputer_strategy={"numeric": "median", "categorical": "mode"},
        scaler="standard",       # "standard", "minmax", "robust", or None
        id_columns=["user_id"],  # Columns to remove
    ),

    # Model training
    modeling_config=ModelingConfig(
        train_models=True,
        n_models=5,
        models_to_train=["random_forest", "xgboost", "logistic_regression"],
    ),

    # Hyperparameter tuning
    optimization_config=OptimizationConfig(
        use_optuna=True,
        optuna_trials_per_model=30,
        optuna_timeout_seconds=600,
    ),

    # Report settings
    reporting_config=ReportingConfig(
        generate_report=True,
        report_detail="detailed",   # "brief" or "detailed"
        include_shap=True,
        plot_mode="simple",         # "simple" or "dashboard"
    ),

    # Parallel processing
    parallel_config=ParallelConfig(
        parallel_processing=True,
        n_jobs=-1,               # -1 = all cores
        backend="threading",
    ),
)

automl.fit(X, y)
```

### Using Individual Components

OctoLearn's components can be used independently:

#### Data Profiling

```python
from octolearn.profiling import DataProfiler

profiler = DataProfiler()
profile = profiler.profile(X, y)

print(f"Shape: {profile.shape}")
print(f"Numeric columns: {profile.numeric_columns}")
print(f"Categorical columns: {profile.categorical_columns}")
print(f"ID-like columns: {profile.id_like_columns}")
print(f"Leakage suspects: {profile.leakage_suspects}")
print(f"Class imbalance ratio: {profile.imbalance_ratio}")
```

#### Auto Cleaning

```python
from octolearn.preprocessing.auto_cleaner import AutoCleaner

cleaner = AutoCleaner(
    imputer_strategy={"numeric": "median"},
    scaler="robust"
)
X_clean, y_clean, cleaning_log = cleaner.fit_transform(X_train, y_train)

# Apply same cleaning to test data
X_test_clean = cleaner.transform(X_test)
```

#### Model Registry

```python
from octolearn.models.registry import ModelRegistry

registry = ModelRegistry(base_dir="./models")
registry.register(model, name="xgboost_v1", metrics={"accuracy": 0.95})

# Load best model later
best = registry.get_best_model(metric="accuracy")
```

### Backward Compatibility

Legacy parameter names are supported via `**kwargs`:

```python
# Both of these work identically:
AutoML(train_models=False)
AutoML(modeling_config=ModelingConfig(train_models=False))
```

---

## 📖 API Reference

### `AutoML` — Main Orchestrator

| Method | Description |
|--------|-------------|
| `fit(X, y)` | Run the complete pipeline |
| `predict(X_new)` | Make predictions using best model |
| `generate_report()` | Generate PDF report |
| `get_risk_score()` | Get data quality risk score (0-100) |
| `get_recommendations()` | Get ML recommendations |
| `get_feature_importance()` | Get feature importance scores |
| `get_preprocessing_suggestions()` | Get preprocessing advice |
| `get_model_benchmarks()` | Get all model metrics |

| Attribute | Description |
|-----------|-------------|
| `raw_profile_` | `DatasetProfile` of raw data |
| `clean_profile_` | `DatasetProfile` of cleaned data |
| `X_`, `y_` | Cleaned feature matrix and target |
| `X_train_`, `X_test_` | Train/test splits |
| `cleaning_log_` | Dictionary of cleaning operations |
| `outlier_results_` | Outlier detection results |
| `trained_models_` | Dictionary of trained models |
| `best_model_` | Best performing model |

### Configuration Dataclasses

<details>
<summary><strong>DataConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `use_full_data` | `bool` | `False` | Use entire dataset (no sampling) |
| `sample_size` | `int` | `500` | Rows to sample if not using full data |
| `test_size` | `float` | `0.2` | Fraction for test split |
| `random_state` | `int` | `42` | Random seed for reproducibility |
| `stratify_target` | `bool` | `True` | Stratify split on target |
</details>

<details>
<summary><strong>ProfilingConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `detect_outliers` | `bool` | `True` | Run outlier detection |
| `analyze_interactions` | `bool` | `False` | Analyze feature interactions |
| `generate_risk_score` | `bool` | `True` | Calculate risk score |
| `calculate_feature_importance` | `bool` | `True` | Compute importance |
| `generate_recommendations` | `bool` | `True` | Generate ML recommendations |
| `include_duplicates_analysis` | `bool` | `True` | Analyze duplicates |
</details>

<details>
<summary><strong>PreprocessingConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `auto_clean` | `bool` | `True` | Enable auto cleaning |
| `imputer_strategy` | `Dict` | `None` | Imputation methods per type |
| `encoder_strategy` | `Dict` | `None` | Encoding strategy |
| `scaler` | `str` | `"standard"` | Scaling method |
| `id_columns` | `List[str]` | `None` | Columns to remove |
</details>

<details>
<summary><strong>ModelingConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `train_models` | `bool` | `True` | Whether to train models |
| `models_to_train` | `List[str]` | `None` | Specific models to train |
| `evaluation_metric` | `str` | `None` | Primary evaluation metric |
| `n_models` | `int` | `5` | Number of models to train |
| `test_size` | `float` | `0.2` | Test split ratio |
</details>

<details>
<summary><strong>OptimizationConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `use_optuna` | `bool` | `True` | Enable Optuna tuning |
| `optuna_trials_per_model` | `int` | `20` | Trials per model |
| `optuna_timeout_seconds` | `int` | `300` | Timeout per model |
| `optuna_parallel_jobs` | `int` | `-1` | Parallel Optuna workers |
| `use_registry` | `bool` | `True` | Save models to registry |
</details>

<details>
<summary><strong>ReportingConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `generate_report` | `bool` | `True` | Generate PDF report |
| `report_detail` | `str` | `"detailed"` | `"brief"` or `"detailed"` |
| `include_shap` | `bool` | `True` | Include SHAP analysis |
| `plot_mode` | `str` | `"simple"` | `"simple"` or `"dashboard"` |
| `visuals_limit` | `int` | `10` | Max plots in report |
</details>

<details>
<summary><strong>ParallelConfig</strong></summary>

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `parallel_processing` | `bool` | `True` | Enable parallelism |
| `n_jobs` | `int` | `-1` | Number of cores (-1 = all) |
| `backend` | `str` | `"threading"` | Joblib backend |
| `verbose` | `int` | `0` | Verbosity level |
</details>

---

## 🏗️ Architecture

```
octolearn/
├── __init__.py              # Public API exports
├── config.py                # Centralized configuration constants
├── core.py                  # AutoML orchestrator (main entry point)
│
├── profiling/
│   └── data_profiler.py     # DataProfiler + DatasetProfile
│
├── preprocessing/
│   ├── auto_cleaner.py      # AutoCleaner (impute/encode/scale)
│   └── pipeline_builder.py  # sklearn Pipeline export
│
├── models/
│   ├── model_trainer.py     # ModelTrainer + Optuna integration
│   └── registry.py          # ModelRegistry (versioned storage)
│
├── evaluation/
│   └── metrics.py           # ModelEvaluator (classification/regression)
│
├── experiments/
│   ├── report_generator.py  # PDF report generation
│   ├── plot_generator.py    # Visualization engine
│   ├── recommendation_engine.py  # ML recommendations
│   ├── risk_scorer.py       # Data quality risk scoring
│   ├── outlier_detector.py  # Multi-method outlier detection
│   ├── baseline_importance.py    # Feature importance
│   └── preprocessing_suggester.py # Preprocessing advice
│
├── feature/
│   └── interaction_analyzer.py   # Feature interaction analysis
│
└── utils/
    └── helpers.py           # Logging, decorators, validation
```

### Pipeline Flow

```
Raw Data ──► Profiling ──► Train/Test Split ──► Auto Cleaning ──► Clean Profiling
                                                      │
                                                      ▼
                    PDF Report ◄── Model Training ◄── Feature Engineering
                                       │
                                       ▼
                              Optuna Optimization ──► Model Registry
```

The pipeline executes 6 phases:

1. **Profiling** — Infer types, detect quality issues, estimate task type
2. **Splitting** — Stratified train/test split
3. **Cleaning** — Impute missing values, encode categoricals, scale numerics
4. **Clean Profiling** — Re-profile the cleaned dataset
5. **Feature Engineering** — Outlier detection + interaction analysis
6. **Model Training** — Train multiple models with optional Optuna HPO

---

## 🧪 Running Tests

```bash
# Activate virtual environment first
python test_complete_pipeline.py
```

This exercises all pipeline phases with the Titanic dataset.

---

## 📝 License

MIT License — see [LICENSE](LICENSE) for details.

---

## 👤 Author

**Ghulam Muhammad Nabeel**

---

<p align="center">
  Built with ❤️ by the OctoLearn team
</p>
