Metadata-Version: 2.4
Name: riskx
Version: 0.1.0
Summary: End-to-End Automated Risk Scoring Platform for Credit, Fraud, and Churn Prediction
Home-page: https://github.com/idrissbado/RiskX
Author: Idriss Bado
Author-email: Idriss Bado <idrissbadoolivier@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/idrissbado/RiskX
Project-URL: Documentation, https://github.com/idrissbado/RiskX/blob/main/README.md
Project-URL: Repository, https://github.com/idrissbado/RiskX
Project-URL: Issues, https://github.com/idrissbado/RiskX/issues
Keywords: risk-scoring,credit-scoring,fraud-detection,ml,automl,risk-management,financial-risk
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Office/Business :: Financial
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: scikit-learn>=1.0.0
Provides-Extra: full
Requires-Dist: xgboost>=1.5.0; extra == "full"
Requires-Dist: lightgbm>=3.3.0; extra == "full"
Requires-Dist: optuna>=2.10.0; extra == "full"
Requires-Dist: shap>=0.40.0; extra == "full"
Requires-Dist: sqlalchemy>=1.4.0; extra == "full"
Requires-Dist: requests>=2.26.0; extra == "full"
Requires-Dist: pyarrow>=6.0.0; extra == "full"
Provides-Extra: ml
Requires-Dist: xgboost>=1.5.0; extra == "ml"
Requires-Dist: lightgbm>=3.3.0; extra == "ml"
Requires-Dist: optuna>=2.10.0; extra == "ml"
Provides-Extra: data
Requires-Dist: sqlalchemy>=1.4.0; extra == "data"
Requires-Dist: requests>=2.26.0; extra == "data"
Requires-Dist: pyarrow>=6.0.0; extra == "data"
Requires-Dist: openpyxl>=3.0.0; extra == "data"
Provides-Extra: explain
Requires-Dist: shap>=0.40.0; extra == "explain"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# 🚀 RiskX - End-to-End Automated Risk Scoring Platform

**v0.1.0** | Production-Ready Core | Credit • Fraud • Churn Risk Scoring

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/riskx.svg)](https://badge.fury.io/py/riskx)

> **RiskX** is a comprehensive, production-ready platform for automated risk scoring. Built for financial institutions, fintech companies, and data scientists working on credit scoring, fraud detection, and customer churn prediction.

---

## 🎯 What is RiskX?

RiskX provides an **end-to-end automated workflow** for risk scoring:

1. **Data Loading** - Multi-source data ingestion (CSV, Excel, SQL, APIs, Cloud)
2. **Data Cleaning** - Automated quality checks and preprocessing
3. **Feature Engineering** - Risk-specific features (WOE/IV, RFM, behavioral)
4. **ML Training** - AutoML with multiple algorithms (LR, RF, XGBoost, LightGBM)
5. **Scoring** - Real-time and batch scoring with interpretability
6. **Monitoring** - Model performance and data drift detection (coming soon)

---

## ✨ Key Features

### 🔥 What's Working NOW (v0.1.0)

#### 1. **Multi-Source Data Loading**
Load data from 8+ different sources:
- CSV, Excel, JSON, Parquet files
- SQL databases (via SQLAlchemy)
- REST APIs
- Cloud data lakes (Azure, AWS, GCP)
- Pandas DataFrames

#### 2. **Automated Data Cleaning**
7 powerful cleaning methods:
- Missing value imputation (6 strategies)
- Outlier detection and handling (IQR, Z-score, clipping)
- Type validation and correction
- Categorical encoding (label, one-hot)
- Feature scaling (standard, min-max)
- Duplicate removal
- **Full automated pipeline** with `auto_clean()`

#### 3. **Risk-Specific Feature Engineering**
Create 50+ features automatically:
- **WOE (Weight of Evidence)** & **IV (Information Value)**
- Optimal binning (quantile, uniform, kmeans)
- **RFM analysis** (Recency, Frequency, Monetary)
- Behavioral features from transactions
- Time-based features (11 datetime extractions)
- Ratio and interaction features
- **Full automated pipeline** with `auto_features()`

#### 4. **AutoML Training**
Train and compare 4 algorithms:
- Logistic Regression
- Random Forest
- XGBoost
- LightGBM
- **Automatic best model selection**
- Model calibration (isotonic, sigmoid)
- Ensemble methods (voting, stacking)
- Hyperparameter optimization (Optuna)

#### 5. **Production-Ready Scoring**
API-ready scoring engine:
- Real-time single predictions
- Batch scoring
- Score range: 300-850 (configurable)
- Risk ratings: Excellent, Very Good, Good, Fair, Poor
- **Reason codes** for interpretability
- Score interpretation and recommendations
- API specification export

---

## 🚀 Quick Start

### Installation

```bash
# Core installation (pandas, numpy, scikit-learn)
pip install riskx

# Full installation (includes XGBoost, LightGBM, Optuna, etc.)
pip install riskx[full]

# ML only (XGBoost, LightGBM, Optuna)
pip install riskx[ml]

# Data sources (SQL, APIs, Parquet, Excel)
pip install riskx[data]
```

### Basic Usage

```python
from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine
from riskx import RiskAutoModel, ScoringEngine

# 1. Load data
connector = RiskDataConnector()
data = connector.from_csv("loan_applications.csv")

# 2. Clean data (automated)
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="default")

# 3. Engineer features (automated)
feature_engine = RiskFeatureEngine()
data_features = feature_engine.auto_features(data_clean, target="default")

# 4. Train models (AutoML)
model = RiskAutoModel()
X = data_features.drop("default", axis=1)
y = data_features["default"]
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# 5. Score new applications
scorer = ScoringEngine(model.get_best_model())
new_application = {
    "income": 75000,
    "credit_history_years": 8,
    "debt_to_income": 0.25,
    "age": 35
}
result = scorer.score_single(new_application)

print(f"Credit Score: {result['score']}")
print(f"Rating: {result['rating']}")
print(f"Risk Level: {result['risk_level']}")
print(f"Reason Codes: {result['reason_codes']}")
```

**Output:**
```
Credit Score: 742
Rating: Very Good
Risk Level: Low
Reason Codes: [
    {'code': 'RC1', 'feature': 'credit_history_years', 'importance': 0.35},
    {'code': 'RC2', 'feature': 'debt_to_income', 'importance': 0.28},
    {'code': 'RC3', 'feature': 'income', 'importance': 0.22}
]
```

---

## 📖 Detailed Examples

### Example 1: Credit Scoring Pipeline

```python
from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine, RiskAutoModel, ScoringEngine

# Load credit application data
connector = RiskDataConnector()
data = connector.from_sql(
    connection_string="postgresql://user:pass@localhost/credit_db",
    query="SELECT * FROM applications WHERE created_date >= '2024-01-01'"
)

# Auto-clean
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="approved")

print(f"Cleaned {len(data_clean)} records")

# Feature engineering with WOE/IV
feature_engine = RiskFeatureEngine()

# Compute WOE/IV for key features
woe_df, iv = feature_engine.compute_woe_iv(data_clean, 'annual_income', 'approved', n_bins=10)
print(f"Information Value: {iv:.4f}")

# Auto-generate all features
data_features = feature_engine.auto_features(data_clean, target='approved')

# Train models
model = RiskAutoModel()
X = data_features.drop('approved', axis=1)
y = data_features['approved']

results = model.train_auto(
    X, y,
    algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'],
    metric='auc'
)

# Get best model
best_model = model.get_best_model()
print(f"Best model AUC: {model.best_score:.4f}")

# Calibrate for better probabilities
calibrated_model = model.calibrate_model(X, y, method='isotonic')

# Score new applications
scorer = ScoringEngine(calibrated_model)
new_apps = [
    {"annual_income": 50000, "debt_ratio": 0.35, "age": 28},
    {"annual_income": 120000, "debt_ratio": 0.15, "age": 42}
]

for app in new_apps:
    score = scorer.score_single(app)
    print(f"Score: {score['score']}, Rating: {score['rating']}")
```

### Example 2: Fraud Detection

```python
from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load transaction data from API
connector = RiskDataConnector()
transactions = connector.from_api(
    url="https://api.example.com/transactions",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    params={"days": 90}
)

# Create behavioral features
feature_engine = RiskFeatureEngine()
behavioral_features = feature_engine.behavioral_features(
    df=transactions,
    customer_id='customer_id',
    time_column='transaction_date',
    value_column='amount'
)

# Features include: recency, frequency, monetary, velocity
print(behavioral_features.head())

# Train fraud detection model
model = RiskAutoModel()
X = behavioral_features.drop('is_fraud', axis=1)
y = behavioral_features['is_fraud']

results = model.train_auto(X, y, algorithms=['rf', 'xgboost'])
```

### Example 3: Churn Prediction

```python
from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load customer data from Data Lake
connector = RiskDataConnector()
customers = connector.from_datalake(
    path="abfss://container@account.dfs.core.windows.net/customers/",
    storage_options={
        "account_name": "your_account",
        "account_key": "your_key"
    }
)

# Time-based features
feature_engine = RiskFeatureEngine()
customers_with_time = feature_engine.time_features(customers, 'last_activity_date')

# Transaction aggregations
customers_with_trans = feature_engine.transaction_features(
    customers,
    group_by='customer_id',
    agg_columns=['purchase_amount', 'login_count', 'support_tickets']
)

# Ratio features (e.g., support_tickets / login_count)
customers_final = feature_engine.ratio_features(
    customers_with_trans,
    numerator_cols=['support_tickets'],
    denominator_cols=['login_count']
)

# Train churn model
model = RiskAutoModel()
X = customers_final.drop('churned', axis=1)
y = customers_final['churned']

results = model.train_auto(X, y, algorithms=['lightgbm', 'xgboost'])
```

---

## 🔧 Advanced Features

### Hyperparameter Optimization

```python
from riskx import RiskAutoModel

model = RiskAutoModel()

# Optimize XGBoost hyperparameters with Optuna
best_params = model.optimize_hyperparameters(
    X_train, y_train,
    algorithm='xgboost',
    n_trials=50
)

print(f"Best parameters: {best_params}")
```

### Ensemble Models

```python
from riskx import RiskAutoModel

model = RiskAutoModel()

# Train multiple models
model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# Create voting ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Or stacking ensemble
stacked_ensemble = model.create_ensemble(X, y, method='stacking')
```

### Batch Scoring

```python
from riskx import ScoringEngine
import pandas as pd

scorer = ScoringEngine(model)

# Score thousands of applications at once
applications_df = pd.read_csv("new_applications.csv")
scored_df = scorer.score_batch(applications_df)

# Results include score, probability, rating, risk_level for each row
scored_df[['score', 'rating', 'risk_level']].head()
```

### Custom Score Binning

```python
from riskx import ScoringEngine

scorer = ScoringEngine(model)

# Custom score bins
custom_bins = {
    'Excellent': (750, 850),
    'Good': (650, 749),
    'Fair': (550, 649),
    'Poor': (300, 549)
}

scorer.set_custom_bins(custom_bins)
```

---

## 📊 API Reference

### RiskDataConnector

**Load data from multiple sources:**

```python
connector = RiskDataConnector()

# CSV files
data = connector.from_csv("data.csv")

# Excel files
data = connector.from_excel("data.xlsx", sheet_name="Sheet1")

# SQL databases
data = connector.from_sql("postgresql://localhost/db", "SELECT * FROM table")

# REST APIs
data = connector.from_api("https://api.example.com/data")

# JSON files
data = connector.from_json("data.json")

# Parquet files
data = connector.from_parquet("data.parquet")

# Cloud data lakes (Azure, AWS, GCP)
data = connector.from_datalake("s3://bucket/path/")

# Pandas DataFrame
data = connector.from_dataframe(df)
```

### RiskCleaner

**7 cleaning methods:**

```python
cleaner = RiskCleaner()

# Data quality profiling
profile = cleaner.profile(df)

# Missing value handling
df_clean = cleaner.clean_missing(df, strategy='auto')  # auto, mean, median, mode, forward, drop, fill

# Outlier handling
df_clean = cleaner.clean_outliers(df, method='iqr')  # iqr, zscore, clip

# Type validation
df_clean = cleaner.clean_types(df, type_map={'age': 'int', 'income': 'float'})

# Categorical encoding
df_encoded = cleaner.encode_categorical(df, columns=['category'], method='onehot')

# Feature scaling
df_scaled = cleaner.normalize(df, columns=['income', 'age'], method='standard')

# Duplicate removal
df_unique = cleaner.remove_duplicates(df)

# Full automated pipeline
df_clean = cleaner.auto_clean(df, target_column='default')
```

### RiskFeatureEngine

**Create risk-specific features:**

```python
engine = RiskFeatureEngine()

# WOE/IV calculation
woe_df, iv = engine.compute_woe_iv(df, 'income', 'default', n_bins=10)

# Optimal binning
df_binned = engine.auto_bin(df, 'age', n_bins=10, method='quantile')

# Behavioral features (RFM)
behavioral = engine.behavioral_features(df, 'customer_id', 'date', 'amount')

# Transaction aggregations
trans_features = engine.transaction_features(df, 'customer_id', ['amount', 'count'])

# Time features (11 extractions)
time_features = engine.time_features(df, 'transaction_date')

# Ratio features
ratio_features = engine.ratio_features(df, ['revenue'], ['cost'])

# Interaction features
interaction_features = engine.interaction_features(df, ['age', 'income'])

# Full automated pipeline
all_features = engine.auto_features(df, target='default')
```

### RiskAutoModel

**AutoML training:**

```python
model = RiskAutoModel()

# Train multiple algorithms
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'])

# Get best model
best = model.get_best_model()

# Calibrate model
calibrated = model.calibrate_model(X, y, method='isotonic')

# Create ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Hyperparameter optimization
best_params = model.optimize_hyperparameters(X, y, algorithm='xgboost', n_trials=50)

# Predictions
probs = model.predict_proba(X_test)

# Save/load
model.save_model("model.pkl")
model.load_model("model.pkl")
```

### ScoringEngine

**Production scoring:**

```python
scorer = ScoringEngine(model, score_min=300, score_max=850)

# Single prediction
result = scorer.score_single({'income': 50000, 'age': 30})
# Returns: {score, probability, rating, risk_level, reason_codes, timestamp}

# Batch scoring
df_scored = scorer.score_batch(df)

# Score interpretation
interpretation = scorer.interpret_score(720)
# Returns: {score, rating, risk_level, recommendation, approval_probability, suggested_interest_rate, percentile}

# Custom bins
scorer.set_custom_bins({'Excellent': (750, 850), 'Good': (650, 749)})

# API specification
api_spec = scorer.export_api_spec()

# Generate scorecard
scorecard = scorer.generate_scorecard(feature_weights)

# Simulate scores (for testing)
simulated = scorer.simulate_score_distribution(n_samples=10000)
```

---

## 🎓 Use Cases

### ✅ Credit Scoring
- Personal loan approvals
- Credit card applications
- Mortgage underwriting
- SME lending

### ✅ Fraud Detection
- Transaction fraud
- Identity fraud
- Account takeover detection
- Payment fraud

### ✅ Churn Prediction
- Customer retention
- Subscription cancellation risk
- Product abandonment
- Service discontinuation

### ✅ Risk Management
- Portfolio risk assessment
- Credit risk monitoring
- Operational risk scoring
- Compliance risk evaluation

---

## 🏗️ Architecture

```
RiskX Architecture
─────────────────

Data Sources → Data Connector → Data Cleaner → Feature Engine → AutoML → Scoring Engine → API/Batch Output
                    ↓               ↓              ↓             ↓           ↓
                  CSV/SQL       Profiling      WOE/IV      XGBoost    Real-time Score
                  Excel/API     Imputation   Behavioral   LightGBM     + Reason Codes
                  Parquet       Outliers      RFM         Ensemble     + Ratings
                  Cloud         Encoding      Time        Calibrated   + Risk Levels
```

---

## 📦 What's Included

### ✅ Core Modules (v0.1.0 - Production Ready)

1. **riskx.core.data_connector** - Multi-source data loading (8+ sources)
2. **riskx.core.data_cleaner** - Automated data cleaning (7 methods)
3. **riskx.core.feature_engineering** - Risk features (WOE/IV, RFM, behavioral)
4. **riskx.core.model_auto** - AutoML training (4 algorithms)
5. **riskx.core.scoring_engine** - Production scoring (real-time + batch)

### ⏳ Coming Soon

6. **riskx.core.monitoring** - PSI, CSI, drift detection
7. **riskx.core.explainability** - SHAP, LIME interpretability
8. **riskx.deployment** - Cloud deployment (Azure, AWS, GCP)
9. **riskx.pipelines** - End-to-end orchestration
10. **riskx.cli** - Command-line interface

---

## 🔬 Technical Details

### Dependencies

**Core (required):**
- pandas >= 1.3.0
- numpy >= 1.21.0
- scikit-learn >= 1.0.0

**Optional (recommended):**
- xgboost >= 1.5.0
- lightgbm >= 3.3.0
- optuna >= 2.10.0
- shap >= 0.40.0
- sqlalchemy >= 1.4.0
- requests >= 2.26.0
- pyarrow >= 6.0.0

### Performance

- **Training:** Optimized with multi-threading (n_jobs=-1)
- **Scoring:** Real-time latency < 10ms
- **Batch Scoring:** 10,000+ records/second
- **Memory:** Efficient column-oriented storage

---

## 💼 Production Deployment

```python
# Save trained model
model.save_model("production_model.pkl")

# Load in production
from riskx import RiskAutoModel, ScoringEngine

model = RiskAutoModel()
model.load_model("production_model.pkl")

scorer = ScoringEngine(model)

# API endpoint example (FastAPI)
from fastapi import FastAPI
app = FastAPI()

@app.post("/score")
def score_application(features: dict):
    result = scorer.score_single(features)
    return result
```

---

## 📜 License

MIT License - see LICENSE file for details

---

## 👨‍💻 Author

**Idriss Bado**  
Email: idrissbadoolivier@gmail.com  
GitHub: [@idrissbado](https://github.com/idrissbado)

---

## 🙏 Acknowledgments

Built with ❤️ for the risk modeling and financial ML community.

---

## 📞 Support

- **Documentation:** [GitHub README](https://github.com/idrissbado/RiskX/blob/main/README.md)
- **Issues:** [GitHub Issues](https://github.com/idrissbado/RiskX/issues)
- **PyPI:** [https://pypi.org/project/riskx/](https://pypi.org/project/riskx/)

---

**Ready to revolutionize your risk scoring? Install RiskX today!**

```bash
pip install riskx[full]
```
