Metadata-Version: 2.4
Name: spark-bestfit
Version: 0.3.1
Summary: Modern Spark distribution fitting library with efficient parallel processing
Project-URL: Documentation, https://spark-bestfit.readthedocs.io/en/latest/
Project-URL: Homepage, https://github.com/dwsmith1983/spark-bestfit
Project-URL: Repository, https://github.com/dwsmith1983/spark-bestfit
Project-URL: Issues, https://github.com/dwsmith1983/spark-bestfit/issues
Author-email: Dustin Smith <dustin.william.smith@gmail.com>
License: MIT
License-File: LICENSE
Keywords: distribution,fitting,pyspark,scipy,spark,statistics
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.11
Requires-Dist: matplotlib<4.0.0,>=3.7.0
Requires-Dist: numpy<3.0.0,>=1.24.0
Requires-Dist: pandas<3.0.0,>=1.5.0
Requires-Dist: scipy<2.0.0,>=1.11.0
Provides-Extra: dev
Requires-Dist: black>=24.10.0; extra == 'dev'
Requires-Dist: isort>=5.13.0; extra == 'dev'
Requires-Dist: mypy>=1.13.0; extra == 'dev'
Requires-Dist: pandas-stubs>=2.2.0; extra == 'dev'
Requires-Dist: pre-commit>=4.5.0; extra == 'dev'
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'dev'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.0.0; extra == 'dev'
Requires-Dist: pytest-spark>=0.6.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Requires-Dist: scipy-stubs>=1.14.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'docs'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=3.0.0; extra == 'docs'
Requires-Dist: sphinx>=8.0.0; extra == 'docs'
Provides-Extra: test
Requires-Dist: pyarrow<19.0.0,>=12.0.0; extra == 'test'
Requires-Dist: pyspark<5.0.0,>=3.5.0; extra == 'test'
Requires-Dist: pytest-cov>=6.0.0; extra == 'test'
Requires-Dist: pytest-spark>=0.6.0; extra == 'test'
Requires-Dist: pytest>=8.3.0; extra == 'test'
Provides-Extra: test-base
Requires-Dist: pytest-cov>=6.0.0; extra == 'test-base'
Requires-Dist: pytest-spark>=0.6.0; extra == 'test-base'
Requires-Dist: pytest>=8.3.0; extra == 'test-base'
Description-Content-Type: text/markdown

# spark-bestfit

[![CI](https://github.com/dwsmith1983/spark-bestfit/actions/workflows/ci.yml/badge.svg)](https://github.com/dwsmith1983/spark-bestfit/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/spark-bestfit)](https://pypi.org/project/spark-bestfit/)
[![Documentation Status](https://readthedocs.org/projects/spark-bestfit/badge/?version=latest)](https://spark-bestfit.readthedocs.io/en/latest/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

**Modern Spark distribution fitting library with efficient parallel processing**

Efficiently fit ~100 scipy.stats distributions to your data using Spark's parallel processing with optimized Pandas UDFs and broadcast variables.

## Features

- **Parallel Processing**: Fits distributions in parallel using Spark
- **~100 Distributions**: Access to nearly all scipy.stats continuous distributions
- **Histogram-Based Fitting**: Efficient fitting using histogram representation
- **Multiple Metrics**: Compare fits using K-S statistic (default), SSE, AIC, and BIC
- **Statistical Validation**: Kolmogorov-Smirnov test with p-values for goodness-of-fit
- **Results API**: Filter, sort, and export results easily
- **Visualization**: Built-in plotting for distribution comparison and Q-Q plots
- **Flexible Configuration**: Customize bins, sampling, and distribution selection

## Installation

```bash
pip install spark-bestfit
```

## Quick Start

```python
from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Generate sample data
data = np.random.normal(loc=50, scale=10, size=10_000)

# Create fitter
fitter = DistributionFitter(spark)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])

# Fit distributions
results = fitter.fit(df, column="value")

# Get best fit (by K-S statistic, the default)
best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f}, p={best.pvalue:.4f})")

# Plot
fitter.plot(best, df, "value", title="Best Fit Distribution")
```

## Compatibility Matrix

| Spark Version | Python Versions | NumPy | Pandas | PyArrow |
|---------------|-----------------|-------|--------|---------|
| **3.5.x** | 3.11, 3.12 | 1.24+ (< 2.0) | 1.5+ | 12.0 - 16.x |
| **4.0.x** | 3.12, 3.13 | 2.0+ | 2.2+ | 17.0+ |

> **Note**: Spark 3.5.x does not support NumPy 2.0. If using Spark 3.5 with Python 3.12, ensure `setuptools` is installed (provides `distutils`).

## API Overview

### Fitting Distributions

```python
from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark, random_seed=123)
results = fitter.fit(
    df,
    column="value",
    bins=100,                    # Number of histogram bins
    support_at_zero=True,        # Only fit non-negative distributions
    enable_sampling=True,        # Enable adaptive sampling
    sample_fraction=0.3,         # Sample 30% of data
    max_distributions=50,        # Limit distributions to fit
)
```

### Working with Results

```python
# Get top 5 distributions (by K-S statistic, the default)
top_5 = results.best(n=5)

# Get best by other metrics
best_sse = results.best(n=1, metric="sse")[0]
best_aic = results.best(n=1, metric="aic")[0]

# Filter by goodness-of-fit
good_fits = results.filter(ks_threshold=0.05)        # K-S statistic < 0.05
significant = results.filter(pvalue_threshold=0.05)  # p-value > 0.05

# Convert to pandas for analysis
df_pandas = results.df.toPandas()

# Use fitted distribution
samples = best.sample(size=10000)  # Generate samples
pdf_values = best.pdf(x_array)     # Evaluate PDF
cdf_values = best.cdf(x_array)     # Evaluate CDF
```

### Custom Plotting

```python
fitter.plot(
    best,
    df,
    "value",
    figsize=(16, 10),
    dpi=300,
    histogram_alpha=0.6,
    pdf_linewidth=3,
    title="Distribution Fit",
    xlabel="Value",
    ylabel="Density",
    save_path="output/distribution.png",
)
```

### Q-Q Plots

```python
# Create Q-Q plot for goodness-of-fit assessment
fitter.plot_qq(
    best,
    df,
    "value",
    max_points=1000,           # Sample size for plotting
    title="Q-Q Plot",
    save_path="output/qq_plot.png",
)
```

### Excluding Distributions

```python
from spark_bestfit import DistributionFitter, DEFAULT_EXCLUDED_DISTRIBUTIONS

# View default exclusions
print(DEFAULT_EXCLUDED_DISTRIBUTIONS)

# Include a specific distribution by removing it from exclusions
exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")
fitter = DistributionFitter(spark, excluded_distributions=exclusions)

# Or exclude nothing (fit all distributions - may be slow)
fitter = DistributionFitter(spark, excluded_distributions=())
```

## Documentation

Full documentation is available at [spark-bestfit.readthedocs.io](https://spark-bestfit.readthedocs.io/en/latest/).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feat/amazing-feature`)
3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
4. Push to the branch (`git push origin feat/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
