Metadata-Version: 2.4
Name: lasr
Version: 0.1.0
Summary: LASER (Least Action Semantic Router) — globally optimal text chunking for RAG
Author: LASR contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/lasr-chunker/lasr
Project-URL: Repository, https://github.com/lasr-chunker/lasr
Project-URL: Bug Tracker, https://github.com/lasr-chunker/lasr/issues
Keywords: chunking,rag,nlp,semantic,text-splitting,retrieval
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: spacy>=3.5
Requires-Dist: click>=8.0
Requires-Dist: sentence-transformers>=2.2
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Requires-Dist: python-dotenv>=1.0; extra == "openai"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# LASR — Least Action Semantic Router

*Pronounced "laser"*

Globally optimal text chunking for RAG pipelines. LASR treats chunking as a physics-inspired optimization problem — it considers every possible way to partition a document and selects the one that minimizes a global objective balancing semantic cohesion against boundary cost.

No heuristics. No greedy local decisions. Just dynamic programming that finds the mathematically optimal partition.

## Install

```bash
pip install lasr
python -m spacy download en_core_web_sm
```

## Quick Start

```python
from lasr import chunk

chunks = chunk(open("document.txt").read())

for c in chunks:
    print(f"[{c.start_char}:{c.end_char}] ({c.num_sentences} sentences)")
    print(c.text)
    print("---")
```

## Control Granularity

```python
from lasr import chunk

# Fewer, larger chunks (higher alpha = more expensive boundaries)
chunks = chunk(document, alpha=3.0)

# More, smaller chunks
chunks = chunk(document, alpha=1.5)

# Adjust sentence constraints
chunks = chunk(document, min_sentences=3, max_sentences=20)
```

## Power-User API

For full control, use `LaserPipeline` and `LaserConfig` directly:

```python
from lasr import LaserPipeline, LaserConfig

config = LaserConfig(
    alpha_base=2.5,     # boundary cost
    rho=1.0,            # tension coefficient
    l_min=5,            # min sentences per chunk
    l_max=30,           # max sentences per chunk
    model_name="all-MiniLM-L6-v2",
)

pipeline = LaserPipeline(config)
chunks = pipeline.chunk(text)

# Each chunk has context bleed for richer retrieval
for c in chunks:
    print(c.text)               # core DP-optimal text
    print(c.text_with_context)  # with 1-sentence bleed from neighbors
```

## CLI

```bash
lasr chunk document.txt --alpha 2.5 --format json
lasr chunk document.txt --format text --output chunks.txt
lasr chunk document.txt --encoder openai --model text-embedding-3-large
```

## Parameters

| Parameter | Default | Effect |
|-----------|---------|--------|
| `alpha` / `alpha_base` | 2.5 | Boundary cost. Higher = fewer, larger chunks. |
| `rho` | 1.0 | Tension coefficient (anchor parameter). |
| `min_sentences` / `l_min` | 5 | Minimum sentences per chunk. |
| `max_sentences` / `l_max` | 30 | Maximum sentences per chunk. |
| `w_struct` | 0.25 | Structural discount (headers, double newlines). |
| `w_bind` | 1.0 | Coreference binding penalty (pronouns). |
| `w_disc` | 0.3 | Discourse connective penalty. |

## Benchmark Highlights

All results use `all-MiniLM-L6-v2` (22M parameters, 384 dimensions) with `alpha=2.5`.

| Dataset | Domain | LASR Recall@5 | Next Best | Margin |
|---------|--------|---------------|-----------|--------|
| **MSMARCO** | Web passages | **0.999** | 0.985 | +0.014 |
| **HotpotQA** | Multi-hop QA | **0.974** | 0.972 | +0.002 |
| **FinanceBench** | SEC filings | **0.930** | 0.629 | +0.301 |
| **CUAD** | Legal contracts | **0.826** | 0.775 | +0.051 |

LASR places first on every retrieval benchmark tested. On FinanceBench, the margin over the next best method is 30 percentage points.

## How It Works

LASR models each document as a chain of semantic units (sentences) and finds the partition that minimizes:

**Action = Tension + Boundary Cost**

- **Tension** measures semantic dispersion inside each chunk (cosine distance to centroid via prefix sums)
- **Boundary Cost** (alpha) penalizes each split, preventing over-fragmentation

The optimization is solved exactly via dynamic programming in O(T * L_max) time, where T is the number of sentences. No approximations, no sampling — the same input always produces the same output.

## Development

```bash
git clone https://github.com/lasr-chunker/lasr
cd lasr
pip install -e ".[dev]"
python -m spacy download en_core_web_sm
pytest
```

## License

MIT
