Metadata-Version: 2.4
Name: tse_ba_comp
Version: 0.1.2
Summary: A library for estimating Biological Age using classical and ML methods
Author: Mehrdad S. Beni & Gary Tse
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-learn
Requires-Dist: xgboost
Requires-Dist: matplotlib
Requires-Dist: scipy
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# tse_ba_comp

[![PyPI version](https://badge.fury.io/py/tse-ba-comp.svg)](https://badge.fury.io/py/tse-ba-comp)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)

**tse_ba_comp** stands for Tse Biological Age Comparator (named after Prof. Gary Tse's Research Group) is a robust, easy-to-use Python library for estimating Biological Age (BA) from clinical biomarkers. 

Developed by **Mehrdad S. Beni** & **Gary Tse**, this package evaluates and ensembles classical mathematical approaches against modern Machine Learning models to provide highly accurate, cross validated age estimations.

### Key Features
* **Classical Models:** Fast, vectorized implementations of the Klemera-Doubal Method (KDM) and PCA-Dubina.
* **Machine Learning:** Pipelines for Elastic Net, Random Forest, and XGBoost.
* **Smart Ensembling:** Automatically combine predictions using Mean or Median strategies to smooth out variance.
* **Automated Preprocessing:** Handles train/test splitting, scaling, and missing data imputation safely to prevent data leakage.
* **Built-in Visualization:** Generates standardized, publication ready scatter plots of Biological Age vs. Chronological Age.

---

## Installation

Install directly from PyPI:

```bash
pip install tse-ba-comp
```

---

## Quick Start

The easiest way to use the library is to pass a csv dataset file path directly to the `run_pipeline` function.

```python
import tse_ba_comp

# define your biomarkers
my_biomarkers = ["albumin", "alp", "bun", "creat", "hba1c", "glucose", "sbp"]

# run tse_ba_comp pipeline
results = tse_ba_comp.run_pipeline(
    data="nhanes4_model_input.csv",
    age_col="age",
    biomarkers=my_biomarkers,
    out_dir="my_results_folder"  # Automatically saves plots and CSVs here
)

# view the evaluation metrics
print(results["metrics"])
```

---

## Advanced Control & Hyperparameters

For researchers and data scientists who need programmatic control, `tse_ba_comp` allows you to construct a configuration dictionary (`ml_params`). You can toggle specific models, set fixed parameters, or trigger an automated Grid Search over custom hyperparameter ranges.

```python
import tse_ba_comp

my_biomarkers = [
    "albumin", "alp", "bun", "creat", "hba1c", "lncrp",
    "lymph", "mcv", "glucose", "rdw", "totchol", "wbc", "sbp"
]

# configure machine learning parameters
ml_settings = {
    "grid_search": True,
    "cv_folds": 5,
    "elastic_net": {
        "run": True,
        "param_grid": {"alpha": [0.1, 1.0], "l1_ratio": [0.1, 0.5, 0.9]}
    },
    "random_forest": {
        "run": True,
        "param_grid": {"n_estimators": [100, 200], "max_depth": [None, 10]}
    },
    "xgboost": {
        "run": False  #skip XGBoost entirely
    }
}

# run the customized pipeline
results = tse_ba_comp.run_pipeline(
    data="nhanes4_model_input.csv",
    age_col="age",
    biomarkers=my_biomarkers,
    imputation_method="knn",       # switch imputation to knn
    test_size=0.3,                 # 30% of data for testing
    random_state=101,              # fix random seed for reproducibility
    run_pca_model=False,           # turn off PCA-Dubina model
    kdm_s2_floor=0.05,             # tweak KDM variance floor
    ml_params=ml_settings,         # apply custom ML settings
    ensemble_method="mean",        # use arithmetic mean for ensemble
    out_dir="advanced_results"
)

print(results["metrics"])
```

---

## Complete API Reference

Below is the complete list of arguments accepted by the `run_pipeline` function.

### Core Data Settings
* `data` *(str or pandas.DataFrame)*: Path to your CSV file, or a loaded Pandas DataFrame.
* `biomarkers` *(list of str)*: List of column names representing the biomarkers to be used.
* `age_col` *(str)*: The column name containing chronological age. Default: `"age"`.

### Processing & Splitting
* `imputation_method` *(str)*: How to handle missing data. Options: `"median"`, `"mean"`, `"zero"`, `"knn"`. Default: `"median"`.
* `test_size` *(float)*: The fraction of the dataset to hold out for testing and evaluation. Default: `0.2`.
* `random_state` *(int)*: Random seed to ensure reproducible train/test splits. Default: `42`.

### Classical Model Toggles
* `run_kdm_model` *(bool)*: Toggle the Klemera-Doubal Method. Default: `True`.
* `kdm_s2_floor` *(float)*: Minimum variance floor for KDM calculations to prevent division by near-zero. Default: `0.1`.
* `run_pca_model` *(bool)*: Toggle the PCA-Dubina method. Default: `True`.

### Machine Learning Settings
* `run_ml_models_flag` *(bool)*: Toggle all Machine Learning models. Default: `True`.
* `cv_folds` *(int)*: Number of cross-validation folds used during training. Default: `5`.
* `ml_params` *(dict)*: A nested dictionary to configure specific ML models. If `None`, fast default settings are used. 

### Ensemble & Outputs
* `ensemble_method` *(str or None)*: How to combine the model predictions. Options: `"median"`, `"mean"`, or `None` (to skip ensemble). Default: `"median"`.
* `out_dir` *(str or None)*: Directory path to save the generated scatter plots and prediction CSVs. If `None`, no files are saved to the disk.

---

## Outputs

The `run_pipeline` function returns a dictionary with two keys:

1. `results["metrics"]`: A Pandas DataFrame containing the Pearson *r*, R^2, RMSE, and MAE for all executed models evaluated strictly on the test set.
2. `results["predictions"]`: A Pandas DataFrame mapping the Chronological Age to the estimated Biological Ages for every patient in the test set.

## Developers

Developed by Dr. Mehrdad S. Beni and Prof. Gary Tse at Hong Kong Metropolitan University, 2026.
