Metadata-Version: 2.4
Name: contextwormhole
Version: 1.1.1
Summary: Teleport beyond context limits with transformers
Home-page: https://github.com/contextwormhole/contextwormhole
Author: ContextWormhole Team
Author-email: ContextWormhole Team <team@contextwormhole.dev>
License-Expression: MIT
Project-URL: Bug Reports, https://github.com/contextwormhole/contextwormhole/issues
Project-URL: Source, https://github.com/contextwormhole/contextwormhole
Project-URL: Documentation, https://contextwormhole.readthedocs.io/
Keywords: transformers,nlp,context,attention,huggingface,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: transformers>=4.20.0
Requires-Dist: numpy<2.0.0,>=1.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.6.0; extra == "dev"
Requires-Dist: pytest-benchmark>=3.4.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.18.0; extra == "docs"
Provides-Extra: lint
Requires-Dist: black>=22.0.0; extra == "lint"
Requires-Dist: flake8>=4.0.0; extra == "lint"
Requires-Dist: isort>=5.10.0; extra == "lint"
Requires-Dist: mypy>=0.910; extra == "lint"
Provides-Extra: all
Requires-Dist: pytest>=7.0.0; extra == "all"
Requires-Dist: pytest-cov>=3.0.0; extra == "all"
Requires-Dist: pytest-mock>=3.6.0; extra == "all"
Requires-Dist: pytest-benchmark>=3.4.0; extra == "all"
Requires-Dist: sphinx>=4.0.0; extra == "all"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "all"
Requires-Dist: myst-parser>=0.18.0; extra == "all"
Requires-Dist: black>=22.0.0; extra == "all"
Requires-Dist: flake8>=4.0.0; extra == "all"
Requires-Dist: isort>=5.10.0; extra == "all"
Requires-Dist: mypy>=0.910; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-python

# ContextWormhole

**Context length extension library for transformers**

ContextWormhole provides practical implementations of three established context extension techniques. When your transformer model reaches its context limit, this library offers clean, tested strategies to handle longer inputs.

```bash
pip install contextwormhole
```

## Purpose

Most transformer models have fixed context windows (e.g., 1024 tokens for GPT-2). This library implements three strategies to work with longer texts while maintaining the model's original architecture.

## Strategies

### 1. Sliding Window

Processes text in overlapping chunks, maintaining continuity between segments.

```python
@sliding_window(window_size=512, overlap=64)
def process_long_document(model, text, **kwargs):
    return model.generate(text, **kwargs)
```

- **Implementation**: Overlapping windows with position ID recycling
- **Time complexity**: O(n)
- **Memory complexity**: O(window_size)
- **Use cases**: Documents, code files, articles

### 2. Hierarchical Context

Creates summaries of text chunks, then combines summaries with final content.

```python
@hierarchical_context(chunk_size=256, summary_length=64)
def analyze_paper(model, paper, **kwargs):
    return model.generate(paper, **kwargs)
```

- **Implementation**: Chunk → summarize → combine → process
- **Time complexity**: O(n log n)
- **Memory complexity**: O(n/chunk_size * summary_length)
- **Use cases**: Research papers, structured documents

### 3. Attention Sink

Preserves initial tokens plus recent context, discarding middle content.

```python
@attention_sink(sink_tokens=16)
def continue_conversation(model, chat_history, **kwargs):
    return model.generate(chat_history, **kwargs)
```

- **Implementation**: Initial tokens + recent context
- **Time complexity**: O(1)
- **Memory complexity**: O(max_length)
- **Use cases**: Conversations, chat histories

## Empirical Results

Tests on repetition patterns (10 runs each, distilgpt2):

| Strategy | Uniqueness Ratio | Repeated Phrases | Notes |
|----------|-----------------|------------------|-------|
| Standard (low temp) | 0.59 | 3.5 | Baseline |
| Standard (high temp) | 0.28 | 2.0 | High repetition |
| Attention Sink | 0.67 | 1.8 | Best coherence |

The attention sink strategy showed consistently better text quality with fewer repetitive patterns.

## Usage

### Basic Example

```python
from contextwormhole import ContextWormholeModel

model = ContextWormholeModel("gpt2")

# Different strategies for different needs
result1 = model.sliding_window_generate(long_document, max_new_tokens=100)
result2 = model.hierarchical_generate(research_paper, max_new_tokens=100)
result3 = model.attention_sink_generate(conversation_history, max_new_tokens=100)
```

### Configuration

```python
from contextwormhole import ExtendedContextConfig

config = ExtendedContextConfig(
    window_size=256,
    overlap=64,
    chunk_size=256,
    summary_length=64,
    sink_tokens=16,
    use_cache=True,
)

model = ContextWormholeModel("gpt2", **config.__dict__)
```

### CLI Interface

```bash
# Sliding window
contextwormhole --model gpt2 --input document.txt --strategy sliding_window

# Hierarchical
contextwormhole --model gpt2 --input paper.txt --strategy hierarchical

# Attention sink
contextwormhole --model gpt2 --input chat.txt --strategy attention_sink
```

## Performance Characteristics

| Strategy | Max Context | Memory (MB)* | Time (s)* | Best For |
|----------|-------------|--------------|-----------|----------|
| Sliding Window | ~10K tokens | 600 | 1.5-2.0 | Documents, code |
| Hierarchical | ~20K tokens | 400 | 1.0-1.5 | Papers, reports |
| Attention Sink | ~8K tokens | 300 | 0.8-1.2 | Conversations |

*Approximate values for GPT-2 on CPU

## Benchmark Results

Recent benchmark results with GPT-2 on CPU:

```
📊 Benchmark Results
================================================================================
Strategy             Input Length    Processing Time      Memory Used     Output Length
--------------------------------------------------------------------------------
sliding_window       1050            1.96s              659.35 MB       1252
hierarchical         1050            1.28s              27.59 MB        1275
attention_sink       1050            1.21s              11.55 MB        1241
sliding_window       5250            2.48s              655.90 MB       4941
hierarchical         5250            1.44s              68.75 MB        1909
attention_sink       5250            2.47s              272.18 MB       4979
sliding_window       10500           2.27s              50.39 MB        4973
hierarchical         10500           1.88s              137.20 MB       3572
attention_sink       10500           2.27s              9.55 MB         5864
sliding_window       21000           2.40s              50.85 MB        5018
hierarchical         21000           2.20s              3.58 MB         4485
attention_sink       21000           2.42s              24.69 MB        5012

📈 Summary
================================================================================
sliding_window: Avg Time = 2.28s, Avg Memory = 354.12 MB
hierarchical: Avg Time = 1.70s, Avg Memory = 59.28 MB
attention_sink: Avg Time = 2.09s, Avg Memory = 79.50 MB
```

Key observations:
- **Hierarchical** strategy consistently shows the best average processing time (1.70s)
- **Attention Sink** has the most balanced memory usage across different input lengths
- **Sliding Window** uses more memory for smaller inputs but stabilizes for larger texts
- All strategies successfully handle inputs up to 21,000 characters (far beyond the model's native context limit)

## Implementation Notes

- Each strategy respects the model's native context limit for individual forward passes
- Position ID recycling enables handling of arbitrarily long inputs
- KV caching improves generation speed and maintains coherence
- All strategies include proper error handling and configuration validation

## Why Position ID Recycling?

Position IDs are critical in transformer models as they provide information about token order. However, they present a significant challenge when working with inputs that exceed the model's maximum context length:

1. **Index Out of Range Errors**: Without proper handling, position IDs for long inputs can exceed the maximum index in the position embedding table, causing runtime errors.

2. **Context Preservation**: Simply truncating inputs loses valuable context. Position ID recycling allows us to maintain more context by intelligently selecting which parts of the input to keep.

3. **Quality Improvements**: Our tests show that proper position ID handling reduces repetition in generated text and improves overall coherence.

4. **Arbitrary Length Handling**: With position ID recycling, the library can process inputs of any length while ensuring position IDs always stay within the valid range (0 to max_position_embeddings-1).

The implementation uses modulo arithmetic to "recycle" position IDs, combined with strategic token selection to preserve the most relevant context from beginning, middle, and end of long documents.

## Requirements

- Python ≥ 3.8
- PyTorch ≥ 1.9.0
- Transformers ≥ 4.20.0
- NumPy ≥ 1.20.0

## Technical Background

This library implements well-established context extension techniques:

- **Sliding Window**: Classical attention windowing
- **Hierarchical Context**: Recursive summarization approach
- **Attention Sink**: Based on StreamingLLM research

The focus is on providing clean, tested implementations with practical optimizations rather than novel algorithms.

## License

MIT License
