Metadata-Version: 2.3
Name: ckmeans
Version: 0.2.5
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Dist: pytest >=7.4.2 ; extra == 'test'
Requires-Dist: ckmeans-1d-dp >=4.3.4.4 ; extra == 'test'
Provides-Extra: test
License-File: license.txt
Summary: Optimal univariate (1D) clustering based on Ckmeans.1d.dp
Keywords: ckmeans,clustering,jenks
Author-email: Stephan Hügel <urschrei@gmail.com>
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/urschrei/ckmeans_py
Project-URL: Tracker, https://github.com/urschrei/ckmeans_py/issues

# CKmeans: Optimal Univariate Clustering

Ckmeans clustering is an improvement on 1-dimensional (univariate) heuristic-based clustering approaches such as [Jenks](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). The algorithm was developed by [Haizhou Wang and Mingzhou Song](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf) (2011) as a [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.

Minimizing the difference within groups – what Wang & Song refer to as `withinss`, or within sum-of-squares – means that groups are optimally homogenous within and the data is split into representative groups. This is very useful for visualization, where one may wish to represent a continuous variable in discrete colour or style groups. This function can provide groups that emphasize differences between data.

Being a dynamic approach, this algorithm is based on two matrices that store incrementally-computed values for squared deviations and backtracking indexes.

Unlike the [original implementation](https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html), this implementation does not include any code to automatically determine the optimal number of clusters: this information needs to be explicitly provided. It **does** provide the `roundbreaks` method to aid labelling, however.

## Implementation
This library uses the [`ckmeans`](https://crates.io/crates/ckmeans) Rust crate, by the same author.

## Benchmarks
Install optional dependencies, then run `benchmark.py`.

[ckmeans-1d-dp](https://pypi.org/project/ckmeans-1d-dp/) is about 20 % faster, but note that it only returns _indices_ identifying each cluster to which the input belongs; if you actually want to cluster your data, you need to do that yourself which I strongly suspect might be slower overall. On the other hand, if all you want is indices it may be a better choice.

# Example
```python
from ckmeans import ckmeans
import numpy as np


data = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])
clusters = 2
result = ckmeans(data, clusters)
assert result == [
    np.array([1.0, 2.0, 3.0, 4.0]),
    np.array([100.0, 101.0, 102.0, 103.0])
]
```

