DistributionRegressor

Nonparametric distributional regression using LightGBM. Predict full probability distributions p(y|x) - not just point estimates.

$ pip install distribution-regressor

Why predict distributions?

Standard regression gives a single number. But real-world outcomes are uncertain, often skewed, and sometimes multimodal. Full distributional predictions unlock richer decision-making.

Uncertainty Quantification

Know how confident a prediction is. Get calibrated prediction intervals, not just point estimates.

Arbitrary Distribution Shapes

Capture multimodal, skewed, heavy-tailed, or zero-inflated distributions - no parametric assumptions required.

Downstream Decisions

Optimize expected utility, compute Value-at-Risk, or feed full densities into downstream systems.

Point PredictionProbabilistic Prediction
Retail demand "Sell 142 units" "80% chance between 120-170, slight chance of spike above 200"
House price "$450,000" "Most likely $430k-$480k, small secondary mode near $520k"
Patient outcome "12 months" "Bimodal: quick recovery by 6 mo or extended course around 18 mo"

How it works

DistributionRegressor uses a soft-target approach that turns distributional regression into a single gradient boosted model trained with cross-entropy loss.

1. Grid
Discretize y-range
into n_bins points
2. Soft Targets
Gaussian kernel
around true yi
3. Expand
Cross (xi, gj)
pairs + targets
4. Train
Single LightGBM
cross-entropy loss

At inference, for a new input x, we query the model at every grid point and normalize the outputs to obtain a proper probability distribution.

Comparison with NGBoost: NGBoost uses natural gradient boosting to learn parameters of a parametric distribution (e.g., Normal, LogNormal). DistributionRegressor is fully nonparametric - it learns the distribution shape directly, making it better suited for complex real-world distributions. It also trains ~100x faster since it uses a single LightGBM model rather than sequential boosting with Hessian computations.

Installation

$pip install distribution-regressor

Requires Python ≥ 3.8

From source

# Clone and install in development mode
git clone https://github.com/guyko81/DistributionRegressor.git
cd DistributionRegressor
pip install -e .

Dependencies

All dependencies are installed automatically via pip.

LightGBM
≥ 3.0.0
NumPy
≥ 1.20.0
Pandas
≥ 1.3.0
scikit-learn
≥ 1.0.0
SciPy
≥ 1.7.0
joblib
≥ 1.0.0

Verify installation

from distribution_regressor import DistributionRegressor
print("Ready!")

Quick Start

Install and start predicting distributions in under 10 lines of code.

from distribution_regressor import DistributionRegressor

# Initialize
model = DistributionRegressor(
    n_bins=50,           # grid resolution
    n_estimators=500,    # LightGBM trees
    sigma='auto',         # automatic kernel width
)

# Train
model.fit(X_train, y_train)

# Point predictions
y_mean = model.predict(X_test)           # expected value
y_mode = model.predict_mode(X_test)      # most likely value

# Uncertainty
intervals = model.predict_interval(X_test, confidence=0.9)
std = model.predict_std(X_test)

# Full distribution
grids, dists, offsets = model.predict_distribution(X_test)

# Evaluate density at specific points
density = model.pdf(X_test, y_test)
nll = model.nll(X_test, y_test)

Visualization

import matplotlib.pyplot as plt

# Predict distribution for a single sample
grids, dists, offsets = model.predict_distribution(X_test[0:1])

plt.plot(grids[0], dists[0], label='Predicted PDF')
plt.axvline(y_test[0], color='r', linestyle='--', label='True Value')
plt.legend()
plt.show()

Key Parameters

ParameterDefaultDescription
n_bins50 Grid resolution. Higher = finer distributions but more compute. 50-200 is typical.
sigma'auto' Kernel bandwidth for soft targets. 'auto' estimates per-sample noise from CV residuals. Float for fixed width.
use_base_modelFalse If True, fits an LGBM baseline and models residuals. Can improve RMSE at the cost of distribution shape fidelity.
monte_carlo_trainingTrue Sample grid points instead of full expansion. Reduces memory ~20x with minimal quality loss.
mc_samples5 Number of grid points per sample in MC mode. 5-10 is usually enough.
n_estimators100 Number of LightGBM boosting iterations.
learning_rate0.1 Boosting learning rate.
**kwargs- Any LightGBM parameter: max_depth, num_leaves, subsample, colsample_bytree, etc.

API Reference

Scikit-learn compatible interface with rich distributional methods.

Constructor

DistributionRegressor(
    n_bins=50,                # grid resolution (higher = finer distributions)
    sigma='auto',              # kernel bandwidth: 'auto' or float
    use_base_model=False,     # anchor to LGBM point predictions
    monte_carlo_training=True,# memory-efficient MC sampling
    mc_samples=5,             # MC points per sample
    n_estimators=100,         # LightGBM trees
    learning_rate=0.1,        # boosting learning rate
    **kwargs                  # passed to LGBMRegressor
)

Prediction Methods

model.predict(X) / model.predict_mean(X)
Expected value E[Y|X]. When use_base_model=True, returns the base LGBM prediction directly.
model.predict_mode(X)
Most likely value (peak of the predicted density).
model.predict_quantile(X, q=0.5)
Arbitrary quantile(s). Pass a list for multiple quantiles at once.
model.predict_interval(X, confidence=0.95)
Symmetric prediction interval [lower, upper].
model.predict_std(X)
Standard deviation of the predicted distribution.
model.predict_distribution(X)
Full distribution: returns (grids, distributions, base_offsets).

Density Evaluation

model.pdf(X, y)
Evaluate predicted density f(y|X) at specific y values.
model.logpdf(X, y)
Log-density log f(y|X).
model.cdf(X, y)
Cumulative distribution P(Y ≤ y | X).
model.ppf(X, q)
Inverse CDF / percent point function.
model.nll(X, y)
Mean negative log-likelihood - a proper scoring rule for density evaluation.

Citation

@software{distributionregressor2025,
  title  = {DistributionRegressor: Nonparametric Distributional Regression},
  author = {Gabor Gulyas},
  year   = {2025},
  url    = {https://github.com/guyko81/DistributionRegressor}
}

Rossmann Store Sales

Predicting daily sales for a retail store. The model captures day-of-week patterns, promotional effects, and holiday uncertainty - including the hard zero on Sundays.

While the default is use_base_model=False, this is a case where enabling the base model is beneficial. Retail sales for a single store are unimodal and well-behaved - the base model anchors point predictions tightly, and the residual distribution captures the remaining uncertainty with narrow, well-calibrated intervals.

from distribution_regressor import DistributionRegressor

model = DistributionRegressor(
    n_bins=100, n_estimators=1000, learning_rate=0.01,
    sigma='auto', random_state=42,
    use_base_model=True,  # beneficial here: unimodal, narrow uncertainty
)
model.fit(train_data[features], train_data['Sales'])

# Prediction intervals for time series visualization
p10 = model.predict_quantile(X, 0.1)
p50 = model.predict_quantile(X, 0.5)
p90 = model.predict_quantile(X, 0.9)

Time series with uncertainty bands

Rossmann sales time series with uncertainty bands
Predicted sales distribution for the test period (last 5 weeks). Blue = actual sales, red bands = 10th-90th and 25th-75th percentile intervals. The base model anchors predictions tightly, producing narrow uncertainty bands. Sundays (dips to zero) are captured cleanly.

Individual distribution shapes

With use_base_model=True, the per-sample adaptive grid centers on the base prediction, giving high resolution where it matters. The closed-day prediction collapses to a tight spike at zero:

Individual predicted densities for Rossmann samples
Predicted p(Sales|x) for selected test days. The closed day shows a sharp spike at zero. Promo days have slightly wider spreads, while regular weekdays are tighter.

IPUMS Income Distribution

Predicting the full income distribution conditional on demographic features (26 variables, 113k samples). Compared against NGBoost on test-set negative log-likelihood.

Benchmark results

MethodTest NLLWeighted NLLTraining Time
DistributionRegressor 11.30 11.32 5 s
NGBoost (LogNormal) 11.11 11.15 363 s
Key takeaway: DistributionRegressor achieves competitive NLL within 0.2 nats of NGBoost, while training ~70x faster and without requiring any distributional assumptions. NGBoost wins on NLL here because income is approximately log-normal and it uses the matching LogNormal parametric family - a strong inductive bias when it happens to match the true distribution.

The limitation of parametric models

However, NGBoost's strength is also its weakness. With a LogNormal distribution, every prediction is forced into a unimodal, right-skewed shape. When the true conditional distribution is multimodal or has a complex shape, the parametric assumption breaks down.

NGBoost can use more complex parametric families to mitigate this, but the user must choose the right family in advance. DistributionRegressor sidesteps this entirely - it learns whatever shape the data requires.

DistributionRegressor vs NGBoost on IPUMS data
DistributionRegressor (blue) captures multimodal and complex conditional densities that NGBoost's LogNormal (green dashed) cannot represent. The parametric model is forced into a single-mode shape regardless of the true data distribution.

With and without base model

The use_base_model option controls whether DistributionRegressor learns distributions directly on the target, or on residuals from a preliminary LightGBM fit.

use_base_model=False (default)

The soft-target model learns p(y|x) directly on the raw target variable.

  • Simpler - single model, no residual computation
  • Preserves multimodal structure in the data
  • Cleaner theoretical foundation
  • Recommended for most use cases

use_base_model=True

First fits a standard LightGBM for point predictions, then models the residual distribution with soft targets.

  • Can improve point prediction (lower RMSE)
  • Per-sample adaptive grid centered on base prediction
  • Higher resolution in the likely region
  • Useful when RMSE performance is critical
# Without baseline (default) - learn full distribution directly
model_direct = DistributionRegressor(use_base_model=False)
model_direct.fit(X_train, y_train)

# With baseline - learn residual distribution around point prediction
model_baseline = DistributionRegressor(use_base_model=True)
model_baseline.fit(X_train, y_train)

Calibration and RMSE

Calibration comparison with and without base model
Calibration plot and RMSE comparison on Rossmann test set. The baseline model offers better RMSE but both modes show comparable calibration.

Trade-off: RMSE vs. distribution shape

While use_base_model=True often delivers better RMSE and can be better calibrated, it models the residual distribution. This means the model learns a narrow, roughly symmetric distribution centered near zero - which can obscure multimodal structure present in the raw target. In cases where the conditional distribution has multiple modes, the direct model (use_base_model=False) preserves this structure.

Baseline mode losing multimodal structure
The direct model (blue) captures complex distribution shapes on IPUMS income data, while the baseline model (orange) collapses to a narrow unimodal residual distribution, missing the true structure.

Monte Carlo training

Full grid expansion creates n_samples × n_bins rows. Monte Carlo training samples a small number of grid points per sample, dramatically reducing memory usage.

monte_carlo_training=False

Each sample is crossed with all grid points. Training set size = n × n_bins.

100k samples × 100 bins = 10M rows

monte_carlo_training=True (default)

Each sample gets mc_samples strategically chosen points (exact peak + Gaussian slope + uniform tails).

100k samples × 5 MC points = 500k rows

In practice, Monte Carlo training produces nearly identical distributions with a fraction of the memory. The importance-weighted sampling scheme ensures the peak region and tails are both well-represented. For most datasets, the default mc_samples=5 is sufficient.
# Full grid - higher memory, marginally more stable
model = DistributionRegressor(monte_carlo_training=False, n_bins=100)

# Monte Carlo (default) - much less memory, nearly identical quality
model = DistributionRegressor(monte_carlo_training=True, mc_samples=5, n_bins=100)

Side-by-side comparison

Monte Carlo vs full grid comparison
Predicted densities using full grid expansion vs. Monte Carlo with K=5 on Rossmann test samples. The distributions are nearly indistinguishable.