Metadata-Version: 2.1
Name: opstats
Version: 1.1.0
Summary: Online parallel statistics calculator.
Home-page: https://github.com/arkershaw/opstats
Author: Andy Kershaw
Author-email: arkershaw@users.noreply.github.com
Project-URL: Bug Tracker, https://github.com/arkershaw/opstats/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Typing :: Typed
Description-Content-Type: text/markdown
License-File: LICENSE

# opstats
Python implementation of an online parallel statistics calculator. This library will calculate the total, mean, variance, standard deviation, skewness and kurtosis. There are additional options for calculating covariance and correlation between two sequences of data points.

Online calculation is appropriate when you don't yet have the entire dataset in order to calculate the mean (e.g. in a streaming environment). It is more processor-intensive than the traditional methods however.

When combined with parallel computation, it can also be useful when the data is very large as it works in a single pass and can be distributed.

## Installation

`pip install opstats`

## Usage

### Online Calculator

```
import random
from opstats import OnlineCalculator
data_points = random.sample(range(1, 100), 20)
stats = OnlineCalculator()
for d in data_points:
    stats.add(d)
result = stats.get()
```

The result will be a NamedTuple containing the computed statistics up until this point. More data can subsequently be added and the result can be retrieved again.

### Parallel Processing

Data can be split into multiple parts and processed in parallel. The resulting statistics can be combined using the `aggregate_stats` function.

```
from opstats import aggregate_stats
# Divide the sample data in half.
left_data = data_points[:len(data_points)//2]
right_data = data_points[len(data_points)//2:]
# Create stats for each half. 
left = OnlineCalculator()
for d in left_data:
    left.add(d)
right = OnlineCalculator()
for d in right_data:
    right.add(d)
# Combine the results.
result = aggregate_stats([left.get(), right.get()])
```

### Covariance and Correlation

The `OnlineCovariance` class and `aggregate_covariance` function work in the same manner as above for calculating the covariance and correlation between two sequences of data points.

## Credits

Online calculator adapted from:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
(Terriberry, Timothy B)

Aggregation translated from:
https://rdrr.io/cran/utilities/src/R/sample.decomp.R
