Metadata-Version: 2.4
Name: riskreg
Version: 0.2.0
Summary: RiskReg: imbalanced regression toolkit (φ relevance, SERT & SERA metrics, NestedCV, oversampling).
Author-email: Bhavneet Singh <bsing048@uottawa.ca>
License: MIT License
        
        Copyright (c) 2024 Bhavneet Singh
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/Bhavneet345/RiskReg
Project-URL: Documentation, https://github.com/Bhavneet345/RiskReg#readme
Project-URL: Source, https://github.com/Bhavneet345/RiskReg
Project-URL: Issues, https://github.com/Bhavneet345/RiskReg/issues
Project-URL: Changelog, https://github.com/Bhavneet345/RiskReg/releases
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.6,>=2.1
Requires-Dist: pandas==2.2.3
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: scipy==1.15.1
Requires-Dist: matplotlib==3.10.0
Requires-Dist: xgboost==2.1.1
Requires-Dist: tqdm>=4.0
Requires-Dist: rich==14.0.0
Provides-Extra: dl
Requires-Dist: tensorflow==2.15.0; python_version < "3.13" and extra == "dl"
Requires-Dist: keras==3.11.0; python_version < "3.13" and extra == "dl"
Requires-Dist: scikeras==0.13.0; extra == "dl"
Requires-Dist: tensorboard==2.15.0; python_version < "3.13" and extra == "dl"
Requires-Dist: h5py>=3.10; extra == "dl"
Requires-Dist: ml_dtypes<1.0.0,>=0.5.1; extra == "dl"
Requires-Dist: protobuf==5.28.3; extra == "dl"
Dynamic: license-file

# RiskReg: Rare-Event Regression Toolkit for Safety-Critical AI

[![PyPI version](https://badge.fury.io/py/riskreg.svg)](https://badge.fury.io/py/riskreg)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![GitHub stars](https://img.shields.io/github/stars/Bhavneet345/RiskReg.svg)](https://github.com/Bhavneet345/RiskReg)
[![GitHub license](https://img.shields.io/github/license/Bhavneet345/RiskReg.svg)](https://github.com/Bhavneet345/RiskReg/blob/main/LICENSE)

| | |
|---|---|
| **PyPI** | **[https://pypi.org/project/riskreg/](https://pypi.org/project/riskreg/)** |
| **Install** | `pip install riskreg` · optional DL extras: `pip install riskreg[dl]` |

**RiskReg** is a comprehensive experimental framework for **imbalanced regression** that addresses the critical challenge of rare but important target values in regression datasets. Unlike classification, regression problems often suffer from **extreme value bias** where rare high-value or low-value targets are underrepresented, leading to poor model performance on critical predictions.

## **What is Imbalanced Regression?**

Traditional regression assumes uniform importance across all target values. However, in real-world scenarios:
- **Housing prices**: Rare luxury properties are more important than common mid-range homes
- **Medical costs**: Extreme treatment costs are more critical than routine expenses  
- **Risk assessment**: High-risk events are more significant than normal occurrences
- **Financial modeling**: Market crashes and booms are more important than stable periods

RiskReg solves this by introducing **relevance-weighted evaluation** and **intelligent oversampling** techniques.

## **Key Features**

### **Relevance Function (φ) System**
Identify rare and important regions in your target distribution using multiple methods:
- **Control Points**: Manual or auto-generated relevance thresholds
- **Density Estimation**: GMM, KDE, Histogram-based relevance
- **Spectral Analysis**: Frequency-domain relevance detection
- **Visualization**: Interactive plots showing relevance curves

### **Regression oversampling**
Synthetic minority oversampling for regression with Gaussian noise (Branco et al., 2017):
- **Smart Sampling**: Over-sample rare regions, under-sample common regions
- **Gaussian Noise**: Add realistic variation to synthetic samples
- **Feature Handling**: Support for both numerical and categorical features
- **Quality Control**: Maintains data distribution characteristics

### **Advanced Evaluation Metrics**
Beyond traditional MAE/RMSE, RiskReg provides relevance-aware metrics:
- **SERA**: Squared Error-Relevance Area (φ-weighted error)
- **SERT**: Squared Error-Relevance at Threshold (cumulative error analysis)
- **Normalization**: Multiple strategies for cross-dataset comparison
- **Bias Analysis**: Quantify prediction bias across target regions

### **Robust Model Evaluation**
- **Nested Cross-Validation**: 5x5 CV with hyperparameter tuning
- **Multiple Models**: Linear Regression, Decision Trees, Random Forest, XGBoost, Neural Networks
- **Fair Comparison**: Relevance-weighted scoring prevents scale bias
- **Statistical Analysis**: Bootstrap confidence intervals and significance testing

### **Multi-Agent Pipeline (NEW)**
- **CrewAI Orchestration**: 5 specialized agents handle the full pipeline end-to-end
- **LangChain Tools**: Each pipeline stage exposed as a callable tool
- **Local LLM**: Runs on-device via llama-cpp-python (GGUF models) — no API keys needed
- **CLI Interface**: Single command to analyze, oversample, train, and evaluate

## **Quick Start**

### Installation

Published on PyPI: **[pypi.org/project/riskreg](https://pypi.org/project/riskreg/)**

```bash
# Install from PyPI
pip install riskreg

# Or with deep learning support
pip install riskreg[dl]
```

### Basic Usage

```python
import pandas as pd
from riskreg import phi, smoter
from riskreg.nestedCV import computeNestedCV

# Load your dataset
df = pd.read_csv('your_data.csv')
target = 'price'  # Your target column

# 1. Compute relevance function
y_phi = phi(df[target], method="default")

# 2. Apply oversampling (smoter)
df_balanced = smoter(
    data=df,
    y=target,
    k=5,                    # Number of neighbors
    pert=0.02,             # Perturbation level
    samp_method="balance", # Sampling strategy
    rel_thres=0.5         # Relevance threshold
)

# 3. Run nested cross-validation
results = computeNestedCV('your_data.csv')
```

### Advanced Usage

```python
# Custom relevance function with manual control points
ctrl_pts = [[100000, 1, 0], [200000, 0, 0], [400000, 1, 0]]
y_phi = phi(df[target], ctrl_pts=ctrl_pts, method="manual")

# Density-based relevance using GMM
y_phi = phi(df[target], method="gmm")

# Evaluate with SERT/SERA metrics (module filename contains &)
import importlib
_sert = importlib.import_module("riskreg.SERT&SERA")
sert_sera_results = _sert.compute_sert_sera("housing", dt=0.05, t0=0.5)
```

## **Multi-Agent Pipeline (LangChain + CrewAI)**

RiskReg includes an AI-powered multi-agent pipeline that orchestrates the entire imbalanced regression workflow using a local LLM. Five specialized agents collaborate sequentially — each backed by LangChain tools that wrap the core library (`riskreg/` in this repo; PyPI package **`riskreg`**).

### Architecture

| Agent | Role | Tools |
|-------|------|-------|
| **Data Analyst** | Profile the dataset, detect imbalance | `list_datasets`, `analyze_dataset` |
| **Phi Configurator** | Compute relevance function | `list_phi_methods`, `compute_phi_values` |
| **Oversampling Specialist** | Balance rare target regions | `run_oversampling` |
| **Model Trainer** | Nested CV with RF, XGB, SVR, DNN | `run_nested_cv` |
| **Evaluation Analyst** | SERT/SERA metrics + bias analysis | `compute_sert_sera`, `run_bias_analysis` |

### Setup

```bash
# Create a virtual environment (Python 3.13 recommended)
python3.13 -m venv .venv
source .venv/bin/activate

# Install core + agent dependencies
pip install -r requirements.txt
pip install "llama-cpp-python[server]" litellm
```

### Running the Pipeline

**Step 1** — Start the local LLM server (keep this running):

```bash
python -m llama_cpp.server \
  --model /path/to/your-model.gguf \
  --n_ctx 4096 --n_gpu_layers -1 \
  --host 127.0.0.1 --port 8787
```

**Step 2** — Launch the agent crew:

```bash
python -m agents.run --dataset housing --target SalePrice --phi-method default
```

### CLI Options

| Flag | Description | Default |
|------|-------------|---------|
| `--dataset` | Dataset name (file stem in `data/`) | `housing` |
| `--target` | Target column name | last column |
| `--phi-method` | `default`, `gmm`, `kde`, `hist`, `spectral` | `default` |
| `--rel-coef` | Box-plot coefficient for phi control points | `1.5` |
| `--list-datasets` | Print available datasets and exit | — |
| `--quiet` | Suppress verbose agent reasoning | — |

### Using Tools Standalone

Each tool can be used independently without the full crew:

```python
from agents.tools.data_tools import list_datasets, analyze_dataset
from agents.tools.phi_tools import compute_phi_values
from agents.tools.oversampling_tools import run_oversampling

# List all benchmark datasets
print(list_datasets.invoke({}))

# Analyze a dataset
print(analyze_dataset.invoke({"dataset_name": "housing"}))

# Compute phi relevance
print(compute_phi_values.invoke({"dataset_name": "housing", "method": "default"}))

# Run oversampling
print(run_oversampling.invoke({"dataset_name": "housing", "target_col": "SalePrice"}))
```

## **Research Results**

Our comprehensive evaluation across **34 diverse datasets** reveals:

### **Model Performance Insights**
- **XGBoost & Random Forest**: Best on scale-normalized metrics (NRMSE/Target normalization)
- **Decision Trees**: Excel on relevance-focused metrics (SERT normalization)  
- **Linear Regression**: Most consistent across different normalizations
- **Neural Networks**: Competitive but require careful hyperparameter tuning

### **Normalization Impact**
- **Adjusted SERA**: 30% average improvement over raw metrics
- **NRMSE/Target normalization**: Best for cross-dataset comparison
- **SERT normalization**: Better for relevance-sensitive applications
- **Combined approaches**: Optimal for comprehensive evaluation

### **Relevance Method Sensitivity**
- **Default (boxplot)**: Most robust across datasets
- **GMM/KDE**: Better for multi-modal distributions
- **Histogram**: Good for discrete target spaces
- **Spectral**: Effective for time-series-like patterns

## **Project Structure**

```
RiskReg/
├── riskreg/                       # Core package (also published on PyPI as `riskreg`)
│   ├── phi.py                     # Relevance function computation
│   ├── phi_ctrl_pts.py           # Control point generation
│   ├── phi_density_methods.py    # Density-based methods
│   ├── smoter.py                 # Regression oversampling (smoter)
│   ├── nestedCV.py               # Nested cross-validation
│   ├── SERT&SERA.py              # Evaluation metrics
│   ├── imbreg_region_bias_analysis.py  # Region bias analysis
│   └── ...
├── agents/                        # LangChain + CrewAI multi-agent pipeline
│   ├── config.py                 # LLM config (local GGUF via llama-cpp-python)
│   ├── crew.py                   # CrewAI crew: 5 agents, 5 tasks
│   ├── run.py                    # CLI entry point
│   └── tools/                    # LangChain @tool wrappers
│       ├── data_tools.py         # list_datasets, analyze_dataset
│       ├── phi_tools.py          # list_phi_methods, compute_phi_values
│       ├── oversampling_tools.py # run_oversampling
│       ├── training_tools.py     # run_nested_cv
│       └── eval_tools.py         # compute_sert_sera, run_bias_analysis
├── data/                          # 34 benchmark datasets
├── pypi/                          # PyPI packaging (`pip install riskreg`; imports `riskreg`)
├── results/                       # Experimental results
│   ├── predictions/              # Model predictions
│   ├── tables/                   # Summary tables
│   └── plots/                    # Visualizations
└── notebooks/                     # Analysis notebooks
```

## **Documentation**

### **PyPI Package Documentation**
For detailed API documentation, installation guides, and examples, see the [pypi/](pypi/) directory.

### **Tutorial Notebooks**
- **RiskReg_TUTORIAL.ipynb**: Complete walkthrough of the framework
- **SERA_SERT_Analysis.ipynb**: Advanced analysis techniques
- **Examples/**: Quick-start examples and use cases

## **Experimental Framework**

### **Datasets (34 total)**
- **Housing**: Boston, California, House prices
- **Healthcare**: Medical costs, treatment outcomes
- **Finance**: Mortgage rates, insurance claims
- **Manufacturing**: Engine performance, quality metrics
- **Energy**: Power consumption, efficiency data
- **And more...**

### **Evaluation Protocol**
1. **Data Preprocessing**: Dataset-specific feature engineering
2. **Relevance Computation**: Multiple φ methods
3. **Model Training**: 5×5 nested cross-validation
4. **Metric Calculation**: Standard + relevance-weighted metrics
5. **Statistical Analysis**: Bootstrap confidence intervals
6. **Visualization**: Comprehensive plots and tables

## 🛠️ **Advanced Features**

### **Bias Analysis**
Use `agents.tools.eval_tools.run_bias_analysis` or follow the script in `riskreg/imbreg_region_bias_analysis.py` (configure `INPUT_CSV` and run as a script).

### **Cross-Dataset Comparison**
```python
from pathlib import Path
from riskreg.combineSertSera import combine
combine(Path("sert&sera"), Path("combined_results.csv"), {"Dataset", "Model", "Method", "Raw SERA", "Adj SERA"})
```

### **Dataset Summarization**
```python
from riskreg.summarize_datasets import summarize
# Configure paths inside summarize() for your phi CSV directory — see riskreg/summarize_datasets.py
```

## **Visualization Examples**

The framework generates comprehensive visualizations:

- **Relevance Curves**: Show φ function across target range
- **SERT Plots**: Cumulative error at different relevance thresholds  
- **Bias Analysis**: Prediction bias across target regions
- **Model Comparison**: Performance across different normalizations
- **Distribution Plots**: Before/after oversampling

## **Contributing**

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.

### **Development Setup**
```bash
git clone https://github.com/Bhavneet345/RiskReg.git
cd RiskReg
pip install -e .
pip install -e .[dl]       # For deep learning features
pip install -e .[agents]   # For multi-agent pipeline (LangChain + CrewAI)
```

## **License**

This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.

### **License Summary**
- **Commercial use** allowed
- **Modification** allowed  
- **Distribution** allowed
- **Private use** allowed
- **Liability** not provided
- **Warranty** not provided

**Full License Text**: [LICENSE](LICENSE)

## **Acknowledgments**

- **Regression oversampling**: Based on Branco et al. (2017)
- **Relevance Functions**: Inspired by Ribeiro (2011)
- **Evaluation Metrics**: SERT/SERA methodology
- **Datasets**: Various sources including UCI ML Repository

## **Contact**

- **Author**: Bhavneet Singh
- **Email**: bsing048@uottawa.ca
- **GitHub**: [@Bhavneet345](https://github.com/Bhavneet345)
- **PyPI**: [riskreg](https://pypi.org/project/riskreg/)

## **Links**

- **GitHub Repository**: https://github.com/Bhavneet345/RiskReg
- **PyPI Package**: https://pypi.org/project/riskreg/
- **Documentation**: [pypi/](pypi/)
- **Issues**: https://github.com/Bhavneet345/RiskReg/issues
- **Releases**: https://github.com/Bhavneet345/RiskReg/releases

---

**RiskReg** - Making regression fair for rare but important predictions.
