Metadata-Version: 2.4
Name: inteprelens
Version: 0.2.1
Summary: InterpreLens: A Lens for Interpreting Large Language Models based on Transformers architecture.
Project-URL: Homepage, https://github.com/Aisuko/inteprelens
Project-URL: Repository, https://github.com/Aisuko/inteprelens
Author: Bowen
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: einops>=0.8.2
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: transformers>=4.30.0
Provides-Extra: datasets
Requires-Dist: datasets>=2.14.0; extra == 'datasets'
Requires-Dist: pyarrow>=12.0.0; extra == 'datasets'
Description-Content-Type: text/markdown

# inteprelens

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Publish](https://github.com/Aisuko/inteprelens/actions/workflows/publish.yml/badge.svg)](https://github.com/Aisuko/inteprelens/actions/workflows/publish.yml)
[![PyPI](https://img.shields.io/pypi/v/inteprelens)](https://pypi.org/project/inteprelens/)

`inteprelens` is the extracted core package for transformer architecture analysis.
It keeps the Causal Head Gating (CHG) workflow and adds tracing utilities for
named transformer stages and internal logits.

## Overview

Use `inteprelens` when you want to:

- train CHG masks over attention heads
- inspect necessary, sufficient, and facilitating heads
- trace intermediate transformer states such as attention output projection inputs,
  block outputs, final norm states, and per-layer logits
- score final-token logits, log-probabilities, and probabilities for calibration-style analysis
- run gradient-based attribution through facilitating head circuits
- probe gradients and kwargs at any layer via context-manager hook utilities

This repo is the reusable core package only. It does not include downstream
task pipelines or visualization workflows.

## Installation

<details open>
<summary>Show</summary>

Install the runtime package:

```bash
pip install inteprelens
```

For local development:

```bash
uv sync --group dev
uv run pytest -q
```

For build and publish checks:

```bash
uv sync --group publish
uv run --group publish python -m build
uv run --group publish twine check dist/*
```

If you use a gated Hugging Face model such as `meta-llama/Llama-3.2-1B`, make sure
`HF_TOKEN` is available in your environment or `.env`.

</details>

## Quick Start

<details open>
<summary>Show</summary>

```python
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=["What is the capital of France?"],
    targets=["Paris"],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

print(results.summary())
print(results.necessary_heads().head())
```

</details>

## Usage Examples

<details>
<summary>Show all examples</summary>

### CHG analysis

<details>
<summary>Show</summary>

```python
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=[
        "The capital of France is",
        "2 + 2 equals",
    ],
    targets=[
        "Paris",
        "4",
    ],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

necessary = results.necessary_heads()
taxonomy = results.head_taxonomy()

print(necessary.head())
print(taxonomy.head())
```

</details>

### Trace transformer stages and logits

<details>
<summary>Show</summary>

```python
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

trace = analyzer.trace(
    texts="Paris is the capital of",
    layers=[0],
    sites=["attn_o_proj_pre", "final_norm", "logits"],
)

print(trace.get("attn_o_proj_pre", 0).shape)
print(trace.get("logits", 0).shape)
print(trace.final_logits.shape)
```

`trace.final_logits` contains the model's final output logits for the traced batch.

</details>

### Score final-token logits, log-probabilities, and probabilities

<details>
<summary>Show</summary>

```python
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

scores = analyzer.score(
    texts=[
        "The capital of France is",
        "2 + 2 equals",
    ],
    temperature=1.0,
)

print(scores.logits.shape)
print(scores.log_probs.shape)
print(scores.probs.shape)

final_logits = scores.final_token_logits()
final_log_probs = scores.final_token_log_probs()
final_probs = scores.final_token_probs()

print(final_logits.shape)
print(final_log_probs.shape)
print(final_probs.shape)
```

Use `logits` as the canonical calibration output, derive `log_probs` for stable
token scoring, and use `probs` when you need confidence-style metrics.

</details>

### Gradient-based attribution through facilitating heads

<details>
<summary>Show</summary>

```python
from inteprelens import CausalCircuitAttribution, LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=["The capital of France is"],
    targets=["Paris"],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

facilitating_mask = results.get_facilitating_mask()
attribution = CausalCircuitAttribution(analyzer.model, analyzer.tokenizer)

sentence_scores = attribution.compute_sentence_importance(
    document="Paris is the capital of France. It is one of Europe's largest cities.",
    sentences=[
        "Paris is the capital of France.",
        "It is one of Europe's largest cities.",
    ],
    summary="Paris is the capital of France.",
    facilitating_mask=facilitating_mask,
)

print(sentence_scores)
```

</details>

### Hook utilities

<details>
<summary>Show</summary>

All hooks are context managers — handles are removed automatically on exit.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from inteprelens.models.adapters import get_adapter
from inteprelens import capture_gradients, capture_kwargs_hook, tensor_grad_hook

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16
).cuda()
adapter  = get_adapter(model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
inputs = tokenizer("The Eiffel Tower is in", return_tensors="pt").to("cuda")
target_id = tokenizer(" Paris", add_special_tokens=False)["input_ids"][0]

# 1. capture_gradients — grad_output flowing through attention layer 0
with capture_gradients(adapter.get_attention_module(0), site="output") as grads:
    loss = -model(**inputs).logits[0, -1, target_id]
    loss.backward()
print(grads[0][0].shape)   # [batch, seq, hidden]

# 2. capture_kwargs_hook — inspect attention_mask kwargs at layer 0
with capture_kwargs_hook(adapter.get_layers()[0]) as kw:
    with torch.no_grad():
        model(**inputs)
print(kw[0].get("attention_mask"))

# 3. tensor_grad_hook — gradient probe on MLP activation
act_ref = []
h = adapter.get_mlp(0).register_forward_hook(lambda m, i, o: act_ref.append(o))
out = model(**inputs)
h.remove()
with tensor_grad_hook(lambda: act_ref[0]) as grads:
    loss = -out.logits[0, -1, target_id]
    loss.backward()
print(grads[0].shape)   # [batch, seq, hidden]
```

</details>

</details>

## Public API

<details open>
<summary>Show</summary>

### Classes

| Class | Description |
|---|---|
| `LensAPI` | High-level interface: CHG fitting, token scoring, named-site tracing. |
| `TransformerTracer` | Lower-level tracer for direct transformer-site activation capture. |
| `CausalCircuitAttribution` | Gradient attribution through facilitating CHG heads. |
| `CHGDataset` | Helper for building CHG-ready datasets from prompt/target pairs. |

### Hook utilities

All functions are context managers exported from `inteprelens` directly.

| Function | PyTorch API | Use case |
|---|---|---|
| `capture_gradients` | `register_full_backward_hook` | Capture `grad_input` / `grad_output` per module during backward |
| `capture_kwargs_hook` | `register_forward_pre_hook(with_kwargs=True)` | Inspect/modify kwargs (`attention_mask`, etc.) before forward (≥ PyTorch 2.0) |
| `tensor_grad_hook` | `Tensor.register_hook` | Gradient probe on any activation tensor |
| `autograd_node_hook` | `autograd.graph.Node.register_hook` | Low-level autograd graph traversal during backward (≥ PyTorch 2.0) |
| `capture_backward_pre` | `register_module_full_backward_pre_hook` | Capture `grad_output` entering a module before backward executes (≥ PyTorch 2.1) |

</details>

## Acknowledgements

<details>
<summary>Show</summary>

`inteprelens` builds on and adapts code and ideas from the
[Causal Head Gating](https://github.com/jonhanke/nam_causal-head-gating) project.
The extracted core package keeps CHG support and extends it with transformer-stage
tracing and internal-logit inspection for architecture analysis workflows.

</details>

## License

This project is released under the MIT License. See [LICENSE](LICENSE).
