Metadata-Version: 2.4
Name: spark-bestfit
Version: 0.7.2
Summary: Modern Spark distribution fitting library with efficient parallel processing
Project-URL: Documentation, https://spark-bestfit.readthedocs.io/en/latest/
Project-URL: Homepage, https://github.com/dwsmith1983/spark-bestfit
Project-URL: Repository, https://github.com/dwsmith1983/spark-bestfit
Project-URL: Issues, https://github.com/dwsmith1983/spark-bestfit/issues
Author-email: Dustin Smith <dustin.william.smith@gmail.com>
License: MIT
License-File: LICENSE
Keywords: distribution,fitting,pyspark,scipy,spark,statistics
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.11
Requires-Dist: matplotlib<4.0.0,>=3.7.0
Requires-Dist: numpy<3.0.0,>=1.24.0
Requires-Dist: pandas<3.0.0,>=1.5.0
Requires-Dist: scipy<2.0.0,>=1.11.0
Provides-Extra: dev
Requires-Dist: black>=24.10.0; extra == 'dev'
Requires-Dist: isort>=5.13.0; extra == 'dev'
Requires-Dist: mypy>=1.13.0; extra == 'dev'
Requires-Dist: pandas-stubs>=2.2.0; extra == 'dev'
Requires-Dist: pre-commit>=4.5.0; extra == 'dev'
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'dev'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.0.0; extra == 'dev'
Requires-Dist: pytest-spark>=0.6.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Requires-Dist: scipy-stubs>=1.14.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'docs'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=3.0.0; extra == 'docs'
Requires-Dist: sphinx>=8.0.0; extra == 'docs'
Provides-Extra: spark
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'spark'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'spark'
Provides-Extra: test
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'test'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'test'
Requires-Dist: pytest-cov>=6.0.0; extra == 'test'
Requires-Dist: pytest-spark>=0.6.0; extra == 'test'
Requires-Dist: pytest>=8.3.0; extra == 'test'
Provides-Extra: test-base
Requires-Dist: pytest-cov>=6.0.0; extra == 'test-base'
Requires-Dist: pytest-spark>=0.6.0; extra == 'test-base'
Requires-Dist: pytest>=8.3.0; extra == 'test-base'
Description-Content-Type: text/markdown

# spark-bestfit

[![CI](https://github.com/dwsmith1983/spark-bestfit/actions/workflows/ci.yml/badge.svg)](https://github.com/dwsmith1983/spark-bestfit/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/spark-bestfit)](https://pypi.org/project/spark-bestfit/)
[![Documentation Status](https://readthedocs.org/projects/spark-bestfit/badge/?version=latest)](https://spark-bestfit.readthedocs.io/en/latest/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

**Modern Spark distribution fitting library with efficient parallel processing**

Efficiently fit ~100 scipy.stats distributions to your data using Spark's parallel processing with optimized Pandas UDFs and broadcast variables.

## Features

- **Parallel Processing**: Fits distributions in parallel using Spark
- **~100 Continuous Distributions**: Access to nearly all scipy.stats continuous distributions
- **16 Discrete Distributions**: Fit count data with Poisson, negative binomial, geometric, and more
- **Histogram-Based Fitting**: Efficient fitting using histogram representation
- **Multiple Metrics**: Compare fits using K-S statistic, A-D statistic, SSE, AIC, and BIC
- **Statistical Validation**: Kolmogorov-Smirnov and Anderson-Darling tests for goodness-of-fit
- **Confidence Intervals**: Bootstrap confidence intervals for fitted parameters
- **Results API**: Filter, sort, and export results easily
- **Visualization**: Built-in plotting for distribution comparison, Q-Q plots and P-P plots
- **Flexible Configuration**: Customize bins, sampling, and distribution selection

## Installation

```bash
pip install spark-bestfit
```

This installs spark-bestfit without PySpark. You are responsible for providing a compatible Spark environment (see Compatibility Matrix below).

**With PySpark included** (for users without a managed Spark environment):

```bash
pip install spark-bestfit[spark]
```

## Quick Start

```python
from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Generate sample data
data = np.random.normal(loc=50, scale=10, size=10_000)

# Create fitter
fitter = DistributionFitter(spark)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])

# Fit distributions
results = fitter.fit(df, column="value")

# Get best fit (by K-S statistic, the default)
best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f}, p={best.pvalue:.4f})")

# Plot
fitter.plot(best, df, "value", title="Best Fit Distribution")
```

## Compatibility Matrix

| Spark Version | Python Versions | NumPy | Pandas | PyArrow |
|---------------|-----------------|-------|--------|---------|
| **3.5.x** | 3.11, 3.12 | 1.24+ (< 2.0) | 1.5+ | 12.0 - 16.x |
| **4.x** | 3.12, 3.13 | 2.0+ | 2.2+ | 17.0+ |

> **Note**: Spark 3.5.x does not support NumPy 2.0. If using Spark 3.5 with Python 3.12, ensure `setuptools` is installed (provides `distutils`).

## API Overview

### Fitting Distributions

```python
from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark, random_seed=123)
results = fitter.fit(
    df,
    column="value",
    bins=100,                    # Number of histogram bins
    support_at_zero=True,        # Only fit non-negative distributions
    enable_sampling=True,        # Enable adaptive sampling
    sample_fraction=0.3,         # Sample 30% of data
    max_distributions=50,        # Limit distributions to fit
)
```

### Working with Results

```python
# Get top 5 distributions (by K-S statistic, the default)
top_5 = results.best(n=5)

# Get best by other metrics
best_sse = results.best(n=1, metric="sse")[0]
best_aic = results.best(n=1, metric="aic")[0]
best_ad = results.best(n=1, metric="ad_statistic")[0]

# Filter by goodness-of-fit
good_fits = results.filter(ks_threshold=0.05)        # K-S statistic < 0.05
significant = results.filter(pvalue_threshold=0.05)  # p-value > 0.05
good_ad = results.filter(ad_threshold=1.0)           # A-D statistic < 1.0

# Convert to pandas for analysis
df_pandas = results.df.toPandas()

# Use fitted distribution
samples = best.sample(size=10000)  # Generate samples
pdf_values = best.pdf(x_array)     # Evaluate PDF
cdf_values = best.cdf(x_array)     # Evaluate CDF

# Access all goodness-of-fit metrics
print(f"K-S: {best.ks_statistic}, p-value: {best.pvalue}")
print(f"A-D: {best.ad_statistic}, A-D p-value: {best.ad_pvalue}")
```

> **Note**: Anderson-Darling p-values are only available for 5 distributions (norm, expon,
> logistic, gumbel_r, gumbel_l) where scipy has critical value tables. For other distributions,
> `ad_pvalue` will be `None` but `ad_statistic` is still valid for ranking fits.

### Parameter Confidence Intervals

```python
# Compute 95% bootstrap confidence intervals
ci = best.confidence_intervals(df, column="value", alpha=0.05, n_bootstrap=1000, random_seed=42)

# Display with parameter names
print(f"Distribution: {best.distribution}")
for param, (lower, upper) in ci.items():
    print(f"  {param}: [{lower:.4f}, {upper:.4f}]")
```

### Custom Plotting

```python
fitter.plot(
    best,
    df,
    "value",
    figsize=(16, 10),
    dpi=300,
    histogram_alpha=0.6,
    pdf_linewidth=3,
    title="Distribution Fit",
    xlabel="Value",
    ylabel="Density",
    save_path="output/distribution.png",
)
```

### Q-Q Plots

```python
# Create Q-Q plot for goodness-of-fit assessment
fitter.plot_qq(
    best,
    df,
    "value",
    max_points=1000,           # Sample size for plotting
    title="Q-Q Plot",
    save_path="output/qq_plot.png",
)
```

### P-P Plots

```python
# Create P-P plot for goodness-of-fit assessment
fitter.plot_pp(
    best,
    df,
    "value",
    max_points=1000,           # Sample size for plotting
    title="P-P Plot",
    save_path="output/pp_plot.png",
)
```

### Discrete Distributions

For count data (integers), use `DiscreteDistributionFitter`:

```python
from spark_bestfit import DiscreteDistributionFitter
import numpy as np

# Generate count data
data = np.random.poisson(lam=7, size=10_000)
df = spark.createDataFrame([(int(x),) for x in data], ["counts"])

# Fit discrete distributions
fitter = DiscreteDistributionFitter(spark)
results = fitter.fit(df, column="counts")

# Get best fit - use AIC for model selection (recommended for discrete)
best = results.best(n=1, metric="aic")[0]
print(f"Best: {best.distribution} (AIC={best.aic:.2f})")

# Plot fitted PMF
fitter.plot(best, df, "counts", title="Best Discrete Fit")
```

**Metric Selection for Discrete Distributions:**

| Metric | Use Case |
|--------|----------|
| `aic` | **Recommended** - Proper model selection criterion with complexity penalty |
| `bic` | Similar to AIC but stronger penalty for complex models |
| `ks_statistic` | Valid for ranking fits, but p-values are not reliable for discrete data |
| `ad_statistic` | Valid for ranking fits (not computed for discrete distributions) |
| `sse` | Simple comparison metric |

> **Note**: The K-S and A-D tests assume continuous distributions. For discrete data, the K-S
> statistic can still rank fits, but p-values are conservative and should not be used for
> hypothesis testing. A-D statistics are not computed for discrete distributions.
> Use AIC/BIC for proper model selection.

### Excluding Distributions

```python
from spark_bestfit import DistributionFitter, DEFAULT_EXCLUDED_DISTRIBUTIONS

# View default exclusions
print(DEFAULT_EXCLUDED_DISTRIBUTIONS)

# Include a specific distribution by removing it from exclusions
exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")
fitter = DistributionFitter(spark, excluded_distributions=exclusions)

# Or exclude nothing (fit all distributions - may be slow)
fitter = DistributionFitter(spark, excluded_distributions=())
```

## Documentation

Full documentation is available at [spark-bestfit.readthedocs.io](https://spark-bestfit.readthedocs.io/en/latest/).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feat/amazing-feature`)
3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
4. Push to the branch (`git push origin feat/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
