Metadata-Version: 2.1
Name: sentences
Version: 0.1.0
Summary: Text segmentation and tokenization utilities for LLMs
Home-page: https://github.com/yourusername/sentences
Author: Your Name
Author-email: Your Name <your.email@example.com>
Project-URL: Homepage, https://github.com/yourusername/sentences
Project-URL: Bug Tracker, https://github.com/yourusername/sentences/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: transformers
Requires-Dist: transformers >=4.0.0 ; extra == 'transformers'

# sentences

Text segmentation and tokenization utilities for LLM tokenizers.

## Features

- **Sentence Splitting**: Split text into sentences with exact position tracking
- **Paragraph Splitting**: Split text into paragraphs while preserving structure
- **Token Range Extraction**: Get exact token ranges for each sentence using iterative tokenization
- **Perfect Reconstruction**: Guaranteed text reconstruction from segments
- **LLM-Ready**: Designed for use with transformer tokenizers and chat templates

## Installation

```bash
pip install sentences
```

For transformer tokenizer support:
```bash
pip install sentences[transformers]
```

## Quick Start

### Sentence Splitting

```python
from sentences import split_text_to_sentences

text = "Dr. Smith went to the store. They bought some milk. It cost $3.50."
sentences, positions = split_text_to_sentences(text)

for i, (sent, pos) in enumerate(zip(sentences, positions)):
    print(f"{i}: {repr(sent)}")
    # Verify reconstruction
    assert text[pos:positions[i+1] if i+1 < len(positions) else len(text)] == sent
```

### Token Range Extraction

Get exact token ranges for sentences with any tokenizer:

```python
from sentences import get_token_ranges
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")

# Example with Qwen3-32B chat template
pre_string = """<|im_start|>system
This system message is just for demonstration purposes.<|im_end|>
<|im_start|>user
Solve this math problem step by step.<|im_end|>
<|im_start|>assistant
<think>
"""

sentences = ["Let me think about this problem. ", "First, I'll break it down. "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)

for sent, (start, end) in zip(sentences, ranges):
    print(f"Tokens [{start}:{end}] = '{sent.strip()}'")
```

### Example with GPT-OSS-20B

```python
# GPT-OSS uses a different format without <think> tags
pre_string = """<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22<|end|><|start|>user<|message|>Solve this math problem step by step.<|end|><|start|>assistant<|channel|>analysis<|message|>"""

sentences = ["Let me analyze this step by step. ", "The key insight is that... "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)
```

## Key Concepts

### Exact Position Tracking

The sentence splitter guarantees that:
```python
text == ''.join(sentences)  # Perfect reconstruction
text[positions[i]:positions[i+1]] == sentences[i]  # Exact position match
```

### Iterative Tokenization

Token ranges are calculated iteratively to avoid boundary issues:
1. Tokenize pre_string → get initial count
2. Tokenize pre_string + sentence1 → get new count
3. Tokenize pre_string + sentence1 + sentence2 → get new count
4. Continue for all sentences

This ensures token boundaries align correctly with how the model will process the text.

## License

MIT License - see LICENSE file for details.
