Metadata-Version: 2.4
Name: fast_cooc
Version: 0.1.0
Summary: GPU-first co-occurrence matrix builder for large text corpora
Author: fast_cooc contributors
License-Expression: Apache-2.0
Project-URL: Repository, https://example.com/fast_cooc
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=2.0
Requires-Dist: scipy>=1.12
Requires-Dist: warpdata>=0.1.0
Requires-Dist: cupy-cuda12x>=14.0.0
Provides-Extra: hf
Requires-Dist: transformers>=4.45; extra == "hf"

# fast_cooc

GPU-first co-occurrence matrix builder for large text corpora. Uses CUDA kernels via CuPy to count word co-occurrences, with automatic dense/sparse mode selection based on vocabulary size and available VRAM.

## Installation

```bash
pip install .
```

For HuggingFace tokenizer support:

```bash
pip install ".[hf]"
```

**Requirements:** CUDA-capable GPU, Python 3.10+, CUDA 12.x toolkit.

## Quickstart

### CLI

```bash
# Wikipedia with default settings (auto mode, regex tokenizer)
fast-cooc --dataset nlp/wikipedia --workdir wiki_out

# Small vocab → auto-selects dense mode (fast atomicAdd path)
fast-cooc --dataset nlp/wikipedia --max-vocab 30000 --workdir wiki_out

# Large vocab → auto-selects sparse mode
fast-cooc --dataset nlp/wikipedia --max-vocab 300000 --workdir wiki_out

# Force a specific mode
fast-cooc --dataset nlp/wikipedia --mode dense --max-vocab 30000

# Use a HuggingFace tokenizer
fast-cooc --dataset nlp/wikipedia --tokenizer hf --hf-model bert-base-uncased
```

### Python API

```python
from fast_cooc import run_warp_pipeline, VocabConfig, CountConfig

out_path = run_warp_pipeline(
    dataset_id="nlp/wikipedia",
    workdir="wiki_out",
    row_limit=100_000,                       # None for full dataset
    vocab_cfg=VocabConfig(min_freq=10, max_vocab=30_000),
    count_cfg=CountConfig(window=5, max_vram_gb=20.0, mode="auto"),
)
# out_path is either wiki_out/embeddings.npy (dense) or wiki_out/cooc_final.npz (sparse)
```

#### Using the GPU kernel directly

```python
import numpy as np
from fast_cooc import count_cooc_gpu, collapse

# corpus_ids: int32 array where each value is a word index (or -1 for OOV)
corpus_ids = np.array([0, 1, 2, 3, 1, 0, 2, 3, 1, 2], dtype=np.int32)
n_vocab = 4

# Dense mode — returns a CuPy (n_vocab, n_vocab) float32 array on GPU
matrix = count_cooc_gpu(corpus_ids, n_vocab, window=2, mode="dense")

# Collapse: log1p → center columns → L2 normalize rows → CPU numpy array
embeddings = collapse(matrix)

# Sparse mode — returns a scipy.sparse.csr_matrix
sparse_matrix = count_cooc_gpu(corpus_ids, n_vocab, window=2, mode="sparse")
```

#### Custom tokenizers

```python
from fast_cooc import run_warp_pipeline, RegexTokenizer, HFTokenizer, make_tokenizer

# Regex (default)
tok = RegexTokenizer(pattern=r"[A-Za-z0-9_']+", lowercase=True)

# HuggingFace subword tokenizer
tok = HFTokenizer(model_name="bert-base-uncased")

# Factory function
tok = make_tokenizer("regex")
tok = make_tokenizer("hf", hf_model="bert-base-uncased")
tok = make_tokenizer("callable", fn=lambda text: text.lower().split())

out = run_warp_pipeline("nlp/wikipedia", "wiki_out", tokenizer=tok)
```

## Dense vs Sparse mode

|  | Dense (`atomicAdd`) | Sparse (`cp.unique` + triplets) |
|---|---|---|
| **VRAM** | `V * V * 4` bytes (30k vocab = 3.6 GB) | Just chunk buffers |
| **Speed** | Fastest — no GPU sort, no CPU roundtrips | Slower — GPU radix sort per chunk |
| **Vocab limit** | ~50k on a 20 GB card | Unlimited |
| **Output** | `embeddings.npy` (after collapse) | `cooc_final.npz` (raw counts) |

**`mode="auto"`** (default) picks dense when the `V x V` matrix fits in 80% of the VRAM budget, sparse otherwise.

## CLI reference

| Flag | Default | Description |
|---|---|---|
| `--dataset` | `nlp/wikipedia` | [warpdata](https://github.com/warpdata) dataset ID |
| `--workdir` | `wiki_cooc_work` | Output directory |
| `--row-limit` | `0` (all) | Max dataset rows to process |
| `--min-freq` | `10` | Minimum token frequency for vocabulary |
| `--max-vocab` | `300000` | Maximum vocabulary size |
| `--window` | `5` | Context window (tokens left and right) |
| `--max-vram-gb` | `20.0` | GPU VRAM budget in GB |
| `--mode` | `auto` | `auto`, `dense`, or `sparse` |
| `--flush-triplets-every` | `20000000` | Sparse mode: flush to CPU every N accumulated entries |
| `--tokenizer` | `regex` | `regex` or `hf` |
| `--hf-model` | `bert-base-uncased` | HuggingFace model name (when `--tokenizer hf`) |

## Pipeline stages

```
1. Build vocabulary     — tokenize dataset, count frequencies, filter by min_freq/max_vocab
2. Write ID memmap      — second pass: map tokens to int32 indices, write to disk
3. GPU co-occurrence    — chunk corpus, launch CUDA kernels, accumulate counts
4. Collapse (dense)     — log1p → column-center → L2 normalize → embeddings.npy
   Save (sparse)        — raw counts → cooc_final.npz
```

## Output files

All outputs are written to `--workdir`:

| File | Description |
|---|---|
| `word2idx.json` | `{word: index}` vocabulary mapping |
| `corpus_ids.i32` | Memory-mapped int32 token ID array |
| `embeddings.npy` | Dense mode: collapsed embeddings `(V, V)` float32 |
| `cooc_final.npz` | Sparse mode: raw co-occurrence counts (scipy CSR) |

## API reference

### `run_warp_pipeline`

```python
run_warp_pipeline(
    dataset_id: str,
    workdir: str,
    row_limit: Optional[int] = None,
    vocab_cfg: VocabConfig = VocabConfig(),
    count_cfg: CountConfig = CountConfig(),
    tokenizer: Optional[TokenizerBackend] = None,
) -> Path
```

End-to-end pipeline. Returns path to the output file.

### `count_cooc_gpu`

```python
count_cooc_gpu(
    corpus_ids: np.ndarray,           # int32 token IDs (-1 = OOV)
    n_vocab: int,
    window: int = 5,
    max_vram_gb: float = 20.0,
    mode: Literal["auto", "dense", "sparse"] = "auto",
    flush_triplets_every: int = 20_000_000,
    progress_every_tokens: int = 0,
) -> Union[cp.ndarray, sp.csr_matrix]
```

GPU co-occurrence counting. Returns CuPy dense array or scipy sparse CSR matrix.

### `collapse`

```python
collapse(matrix: cp.ndarray) -> np.ndarray
```

Transforms a dense co-occurrence matrix into normalized embeddings on GPU: `log1p` → subtract column means → L2 normalize rows. Returns a CPU numpy array.

### Config

```python
VocabConfig(min_freq: int = 10, max_vocab: int = 300_000)
CountConfig(window: int = 5, max_vram_gb: float = 20.0, mode: str = "auto", flush_triplets_every: int = 20_000_000)
```

### Tokenizers

All tokenizers implement the `TokenizerBackend` protocol (`tokenize(text: str) -> Iterable[str]`).

| Class | Args |
|---|---|
| `RegexTokenizer` | `pattern=r"[A-Za-z0-9_']+"`, `lowercase=True` |
| `HFTokenizer` | `model_name`, `lowercase=False`, `use_fast=True` |
| `CallableTokenizer` | `fn: Callable[[str], Iterable[str]]` |
