Metadata-Version: 2.1
Name: diameter-clustering
Version: 0.0.1
Summary: Clustering with maximum distance between points inside clusters
Home-page: https://github.com/antklen/diameter-clustering
Author: Anton Klenitskiy
Author-email: ant-klen@yandex.ru
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: numpy-groupies
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: tqdm

# Clustering with maximum diameter

Clustering algorithms with maximum distance between points inside clusters.

When we have interpetable metric like cosine distance it could be nice to have clusters with maximum distance between points. Then we can find good threshold for maximum distance and be confident that points inside clusters are really similar. Unfortunately popular clustering algorithms don't have such behavior.

Main algorithm is MaxDiameterClustering. It is a simple greedy algorithm, in which we add points one by one. If there is a cluster with all points close enough to new points, then we add new point to this cluster. If there is no such cluster, this point starts new cluster.

Also two similar algorithms are added - Leader Clustering and Quality Threshold Clustering.

## Usage

### MaxDiameterClustering

Basic usage of MaxDiameterClustering:
```python
from sklearn.datasets import make_blobs
from diameter_clustering import MaxDiameterClustering

X, y = make_blobs(n_samples=100, n_features=50)

model = MaxDiameterClustering(max_distance=0.3, metric='cosine')
labels = model.fit_predict(X)
```

When we want to compute cosine distance and our vectors are normalized, it is better to use
`inner_product` as metric because it is much faster:
```python
X_normalized = X/(np.linalg.norm(X, axis=-1, keepdims=True) + 1e-16)

model = MaxDiameterClustering(max_distance=0.3, metric='inner_product')
labels = model.fit_predict(X_normalized)
```

Instead of using feature matrix `X` we can pass precomputed distance matrix:
```python
from diameter_clustering.dist_matrix import compute_dist_matrix

dist_matrix = compute_dist_matrix(X, metric='cosine')

model = MaxDiameterClustering(max_distance=0.3, precomputed_dist=True)
labels = model.fit_predict(dist_matrix)
```

Calculation of full distance matrix between all points is expensive, so for big datasets
it is better to use distance matrix in sparse format:
```python
model = MaxDiameterClustering(max_distance=0.3, metric='cosine', sparse_dist=True)
labels = model.fit_predict(X)

model = MaxDiameterClustering(max_distance=0.3, sparse_dist=True, precomputed_dist=True)
dist_matrix = compute_sparse_dist_matrix(X, max_distance=0.3, metric='cosine')
labels = model.fit_predict(dist_matrix)
```

With `deterministic=True` we can get reproducible results:
```python
model = MaxDiameterClustering(max_distance=0.3, metric='cosine', deterministic=True)
labels = model.fit_predict(X)
```



### Leader Clustering

```python
from diameter_clustering import LeaderClustering

model = LeaderClustering(max_radius=0.15, metric='cosine')
labels = model.fit_predict(X)
```

Precomputed distance, sparse distance, deterministic behavior  and inner_product
could be used as in MaxDiameterClustering.


### Quality Threshold Clustering

```python
from diameter_clustering import QTClustering

model = QTClustering(max_radius=0.15, metric='cosine', min_cluster_size=5)
labels = model.fit_predict(X)
```

Precomputed distance, sparse distance  and inner_product
could be used as in MaxDiameterClustering. This algorithm is deterministic by design.







