Metadata-Version: 2.3
Name: clusx
Version: 0.3.1
Summary: Tool for clustering, analyzing, and benchmarking text data with advanced embeddings and statistical validation.
License: MIT
Keywords: clustering,text-analysis,nlp,natural-language-processing,machine-learning,dirichlet-process,pitman-yor-process,embeddings,sentence-transformers,power-law,data-science
Author: Serghei Iakovlev
Author-email: oss@serghei.pl
Requires-Python: >=3.11,<4
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Dist: click (>=8.1.8,<9.0.0)
Requires-Dist: matplotlib (>=3.10.1,<4.0.0)
Requires-Dist: numpy (>=2.2.3,<3.0.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: powerlaw (>=1.5,<2.0)
Requires-Dist: scikit-learn (>=1.6.1,<2.0.0)
Requires-Dist: scipy (>=1.15.2,<2.0.0)
Requires-Dist: sentence-transformers (>=3.4.1,<4.0.0)
Requires-Dist: torch (>=2.3.0,<3.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Project-URL: Bug Tracker, https://github.com/sergeyklay/clusterium/issues
Project-URL: Documentation, https://clusterium.readthedocs.io
Project-URL: Homepage, https://clusterium.readthedocs.io
Project-URL: Repository, https://github.com/sergeyklay/clusterium
Description-Content-Type: text/markdown

# Clusterium

[![CI](https://github.com/sergeyklay/clusterium/actions/workflows/ci.yml/badge.svg)](https://github.com/sergeyklay/clusterium/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/sergeyklay/clusterium/graph/badge.svg?token=T5d9KTXtqP)](https://codecov.io/gh/sergeyklay/clusterium)
[![Documentation Status](https://readthedocs.org/projects/clusterium/badge/?version=latest)](https://clusterium.readthedocs.io/en/latest/?badge=latest)

A toolkit for clustering, analyzing, and benchmarking text data using state-of-the-art embedding models and clustering algorithms.

## Features

- **Dirichlet Process Clustering**: Implements the Dirichlet Process for text clustering
- **Pitman-Yor Process Clustering**: Implements the Pitman-Yor Process for text clustering with improved performance
- **Evaluation**: Evaluates clustering results using a variety of metrics, including Silhouette Score, Davies-Bouldin Index, and Power-law Analysis
- **Visualization**: Generates plots of cluster size distributions

## Installation

For detailed installation instructions, please see the [Installation Guide](https://clusterium.readthedocs.io/en/latest/installation.html).

### Quick Start

```bash
git clone https://github.com/sergeyklay/clusterium.git
cd clusterium
poetry install
```

## Usage

For detailed usage instructions, use cases, examples, and advanced configuration options, please see the [Usage Guide](https://clusterium.readthedocs.io/en/latest/usage.html).

### Quick Start

```bash
# Run clustering
clusx --input your_data.csv --column your_column --output clusters.csv

# Evaluate clustering results and generate visualizations
clusx evaluate \
  --input input.csv \
  --column your_column \
  --dp-clusters output_dp.csv \
  --pyp-clusters output_pyp.csv \
  --plot
```

### Python API Example

```python
from clusx.clustering import DirichletProcess
from clusx.clustering.utils import load_data_from_csv, save_clusters_to_json

# Load data
texts, data = load_data_from_csv("your_data.csv", column="your_column")

# Perform clustering
dp = DirichletProcess(alpha=1.0)
clusters, params = dp.fit(texts)

# Save results
save_clusters_to_json("clusters.json", texts, clusters, "DP", data)
```

## Documentation

Full documentation is available at [https://clusterium.readthedocs.io/](https://clusterium.readthedocs.io/).

## License

This project is licensed under the MIT License - see the LICENSE file for details.

