Metadata-Version: 2.2
Name: qkmeans
Version: 0.1.0
Summary: Fast k-means++ seeding under the manifold hypothesis
Keywords: clustering,k-means,machine-learning,manifold-hypothesis
Author: Anonymous
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Project-URL: Homepage, https://github.com/PoojanCShah/icml2026
Project-URL: Repository, https://github.com/PoojanCShah/icml2026
Requires-Python: >=3.9
Requires-Dist: numpy>=1.20
Requires-Dist: scikit-learn>=1.0
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.7.0; extra == "faiss"
Provides-Extra: faiss-gpu
Requires-Dist: faiss-gpu>=1.7.0; extra == "faiss-gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: cibuildwheel>=2.21; extra == "dev"
Description-Content-Type: text/markdown

# QKMeans

Fast k-means++ seeding under the manifold hypothesis.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Overview

QKMeans implements the rejection-sampling-based k-means++ seeding algorithm from:

> **Faster k-means Seeding Under The Manifold Hypothesis**
> Anonymous Authors
> ICML 2026 (Under Review)

The algorithm achieves **O(nD) + O(D(k log k)^(1+δ))** runtime under the manifold hypothesis, compared to **O(nkD)** for standard k-means++, while providing an **O(ρ⁻² log k)** approximation guarantee.

## Installation

### From pre-built wheels (recommended)

```bash
pip install qkmeans
```

### From source

```bash
pip install git+https://github.com/PoojanCShah/icml2026.git
```

### With optional FAISS backend

```bash
pip install "qkmeans[faiss]"
```

## Quick Start

```python
from qkmeans import QKMeansSeeding
import numpy as np

# Generate sample data
X = np.random.randn(10000, 128)

# Basic usage
seeder = QKMeansSeeding(n_clusters=100)
seeder.fit(X)

print(f"Centers shape: {seeder.cluster_centers_.shape}")
print(f"Seeding time: {seeder.seeding_time_:.2f} ms")

# Use as sklearn-compatible estimator
labels = seeder.fit_predict(X)
distances = seeder.transform(X)
```

## Features

- **Fast**: Achieves sublinear-in-k runtime under the manifold hypothesis
- **Accurate**: Provides O(log k) approximation guarantee like standard k-means++
- **scikit-learn compatible**: Drop-in replacement with familiar API
- **Configurable ANNS backends**: LSH (default), FAISS, HNSW, or brute force
- **Optimized**: AVX2 SIMD + OpenMP parallelization

## API Reference

### QKMeansSeeding

```python
QKMeansSeeding(
    n_clusters=8,           # Number of clusters
    anns='lsh',             # ANNS backend: 'lsh', 'faiss', 'hnsw', 'brute_force'
    anns_params=None,       # Backend-specific parameters
    max_rejections=None,    # Max rejection iterations (auto if None)
    random_state=None,      # Random seed
    n_jobs=None,            # Number of threads (-1 for all)
)
```

### ANNS Backend Parameters

**LSH** (default):
```python
anns_params = {
    'num_tables': 10,   # Number of hash tables
    'num_hashes': 8,    # Hash functions per table
    'rho': 0.5,         # Approximation parameter
}
```

**HNSW** (requires hnswlib):
```python
anns_params = {
    'M': 16,                  # Bi-directional links per node
    'ef_construction': 200,   # Construction parameter
    'ef_search': 50,          # Search parameter
}
```

**FAISS** (requires faiss-cpu/faiss-gpu):
```python
anns_params = {
    'nlist': 100,   # Number of Voronoi cells
    'nprobe': 10,   # Cells to probe during search
}
```

## Benchmarks

On SIFT1M (n=1M, d=128, k=1000):

| Method | Time | Speedup |
|--------|------|---------|
| sklearn k-means++ | 45.2s | 1x |
| QKMeans (LSH) | 3.8s | 11.9x |
| QKMeans (HNSW) | 2.1s | 21.5x |

## Algorithm Details

The algorithm uses rejection sampling to approximate the D² distribution:

1. **Preprocess**: Center data, compute squared norms in O(nD)
2. **Sample first center**: Uniformly at random
3. **For each subsequent center**:
   - Sample from proposal distribution κ(x|C) using alias method
   - Accept with probability r = cost(x,C) / (2·proposal_weight)
   - If rejected too many times, fall back to uniform sampling
4. **Insert accepted point** into ANNS data structure

Under the manifold hypothesis, the scaling laws guarantee that the oversampling factor β = O(k^ε) where ε = 2/d, leading to the improved runtime.

## Citation

```bibtex
@inproceedings{anonymous2026qkmeans,
  title={Faster k-means Seeding Under The Manifold Hypothesis},
  author={Anonymous},
  booktitle={International Conference on Machine Learning},
  year={2026},
  note={Under review}
}
```

## License

MIT License. See [LICENSE](LICENSE) for details.
