Metadata-Version: 2.4
Name: lmprobe
Version: 0.4.8
Summary: Train linear probes on language model activations for AI safety monitoring
Project-URL: Homepage, https://github.com/AlliedToasters/lmprobe
Project-URL: Documentation, https://github.com/AlliedToasters/lmprobe#readme
Project-URL: Repository, https://github.com/AlliedToasters/lmprobe
Author: Toast
License-Expression: MIT
License-File: LICENSE
Keywords: ai-safety,interpretability,language-models,machine-learning,nlp,probing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: nnsight>=0.3
Requires-Dist: numpy>=1.20
Requires-Dist: scikit-learn>=1.0
Requires-Dist: torch>=2.0
Requires-Dist: tqdm>=4.0
Requires-Dist: transformers>=4.30
Provides-Extra: auto
Requires-Dist: skglm>=0.3; extra == 'auto'
Provides-Extra: dev
Requires-Dist: black>=23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.0; extra == 'embeddings'
Provides-Extra: plot
Requires-Dist: matplotlib>=3.5; extra == 'plot'
Requires-Dist: seaborn>=0.12; extra == 'plot'
Description-Content-Type: text/markdown

# `lmprobe` Language Model Probe Library

[![PyPI version](https://badge.fury.io/py/lmprobe.svg)](https://pypi.org/project/lmprobe/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

This library supports the use of language model "activations" or "latents" to build text classifiers. The intent is to help detect and reduce misuse of AI - for example, chemical, biological, radiological and nuclear (CBRN) weapons development, social engineering at scale, and the development of novel cybersecurity attack vectors.

## Linear and Simple Models for LLMs
"Linear Probes" have emerged as an effective and practical way to monitor large language model activity. 

### Background

First introduced by [Alain & Bengio (2016)](https://arxiv.org/abs/1610.01644) as "thermometers" for measuring what neural networks learn at each layer, linear probes have since been refined through work on [probe design and selectivity](https://nlp.stanford.edu/~johnhew/interpreting-probes.html) and validated by evidence supporting the [linear representation hypothesis](https://www.neelnanda.io/mechanistic-interpretability/othello). The [Representation Engineering](https://arxiv.org/abs/2310.01405) framework (Zou et al., 2023) demonstrated that probes can monitor safety-relevant properties like honesty and power-seeking. Recent AI safety research has shown promising results: Anthropic's work on [detecting sleeper agents](https://www.anthropic.com/research/probes-catch-sleeper-agents) achieved >99% AUROC using simple linear classifiers, and Apollo Research's [strategic deception detection](https://arxiv.org/abs/2502.03407) work demonstrates that probes trained on simple contrast pairs can generalize to realistic scenarios like insider trading concealment and sandbagging on safety evaluations.

### `lmprobe` Use Cases

The goal of `lmprobe` is to make text classifiers for language models easy to build, experiment on, and deploy during inference. While much of the research has focused on complex emergent risky behavior, the intended use of this library is for simpler use cases such as detection of the misuse of an AI chatbot by humans.

### Compatibility

By default, `lmprobe` uses huggingface and `nnsight` to manage models and extract latents during inference. However, the library is structured to modularize and isolate these aspects so that (ideally) frontier AI labs can extend the library for internal use on their bespoke inference systems.

### Installation

```
pip install lmprobe
```

### Environment Setup

For remote execution (large models via nnsight/NDIF):

```bash
export NNSIGHT_API_KEY="your-api-key-here"
```

### Example Usage

---

```python
from lmprobe import LinearProbe

positive_prompts = [  # positive class: "dog" without saying "dog"
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
    "Good boy!",
    "Slobbering, chewing, growling, barking.",
]

negative_prompts = [  # negative class: "cat" without saying "cat"
    "Enjoys lounging in the sun beam all day.",
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert, whiskers forward.",
]

# Configure the probe
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,                              # int, list[int], or "all"
    pooling="last_token",                   # applies to both train and inference
    classifier="logistic_regression",       # or pass sklearn estimator
    device="auto",
    remote=False,                           # True for nnsight remote execution
    random_state=42,                        # for reproducibility
)

# Fit using contrastive prompts
probe.fit(positive_prompts, negative_prompts)

# Predict on new examples
test_prompts = [
    "Arf! Arf! Let's go outside!",
    "Knocking things off the counter for sport.",
]
predictions = probe.predict(test_prompts)          # [1, 0]
probabilities = probe.predict_proba(test_prompts)  # [[0.12, 0.88], [0.91, 0.09]]

# Evaluate
accuracy = probe.score(test_prompts, [1, 0])

# Save/load for deployment
probe.save("dog_vs_cat_probe.pkl")
loaded_probe = LinearProbe.load("dog_vs_cat_probe.pkl")
```

---

## Remote Execution for Large Models

Use `remote=True` to run inference on large models via nnsight's remote servers:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-70B-Instruct",
    layers="middle",
    remote=True,  # Requires NNSIGHT_API_KEY
)

probe.fit(positive_prompts, negative_prompts)

# Override remote per-call (e.g., train remote, predict local)
predictions = probe.predict(new_prompts, remote=False)
```

---

## Multi-Layer Probing

When selecting multiple layers, activations are **concatenated** along the hidden dimension:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[14, 15, 16],  # 3 layers × 4096 dims = 12,288-dim input to classifier
)
```

---

## Advanced: Different Train vs Inference Pooling

For real-time monitoring, train on a stable representation but score every token:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",          # base strategy
    inference_pooling="all",       # override: return per-token scores
)

probe.fit(positive_prompts, negative_prompts)

# Returns (batch, seq_len) - one score per token
token_scores = probe.predict_proba(["Wagging my tail happily!"])
```

For "flag if ANY token triggers" detection:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",          # base strategy  
    inference_pooling="max",       # override: max score across tokens
)
```

---

## Configuration Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | `str` | *required* | HuggingFace model ID or local path |
| `layers` | `int \| list[int] \| "all"` | `"middle"` | Which residual stream layers to probe |
| `pooling` | `str \| callable` | `"last_token"` | Token aggregation (train & inference) |
| `train_pooling` | `str \| callable` | — | Override pooling for `fit()` only |
| `inference_pooling` | `str \| callable` | — | Override pooling for `predict()` only |
| `classifier` | `str \| sklearn estimator` | `"logistic_regression"` | Classification model |
| `device` | `str` | `"auto"` | `"auto"`, `"cuda:0"`, `"cpu"` |
| `remote` | `bool` | `False` | Use nnsight remote execution (requires `NNSIGHT_API_KEY`) |
| `random_state` | `int \| None` | `None` | Random seed for reproducibility (propagates to classifier) |

### Pooling Strategies

| Strategy | Training | Inference | Description |
|----------|:--------:|:---------:|-------------|
| `"last_token"` | ✓ | ✓ | Final token activation (default, matches RepE literature) |
| `"mean"` | ✓ | ✓ | Mean across all tokens |
| `"first_token"` | ✓ | ✓ | First token (e.g., `[CLS]`) |
| `"all"` | ✓ | ✓ | Each token independently |
| `"max"` | | ✓ | Max score across tokens |
| `"min"` | | ✓ | Min score across tokens |

### Pooling Collision Rules

Explicit parameters override the base `pooling` value:

```python
# pooling="mean", train_pooling="last_token" → train=last_token, inference=mean
# pooling="mean", inference_pooling="max"    → train=mean, inference=max
```

---

## Classifier Options

`lmprobe` supports several built-in classifiers:

| Classifier | Description |
|------------|-------------|
| `"logistic_regression"` | Standard logistic regression (default) |
| `"ridge"` | Ridge classifier - fast, no `predict_proba` |
| `"svm"` | Support Vector Machine with probability calibration |
| `"lda"` | Linear Discriminant Analysis |
| `"mass_mean"` | Mass-Mean Probing - uses direction between class centroids |
| `"sgd"` | Stochastic Gradient Descent classifier |

```python
# Use Mass-Mean Probing (simple but effective)
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    classifier="mass_mean",
)
```

---

## Layer Importance Analysis

Identify which layers are most informative for your task:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="all",  # Extract all layers
    classifier="ridge",
)

probe.fit(positive_prompts, negative_prompts)

# Compute per-layer importance scores
importances = probe.compute_layer_importance(metric="l2")

# Visualize layer importance
probe.plot_layer_importance()
```

### Fast Auto Layer Selection

Automatically select the most important layers using fast importance analysis:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="fast_auto",      # Auto-select best layers
    fast_auto_top_k=3,       # Use top 3 most important layers
    normalize_layers=True,   # Normalize before combining
)

probe.fit(positive_prompts, negative_prompts)
print(f"Selected layers: {probe.selected_layers_}")
```

---

## Baseline Comparisons

Use baselines to validate that your probe is learning something beyond surface features.

### Text-Only Baselines

```python
from lmprobe import BaselineProbe

# Bag-of-words baseline
bow_baseline = BaselineProbe(method="bow", classifier="logistic_regression")
bow_baseline.fit(positive_prompts, negative_prompts)
bow_accuracy = bow_baseline.score(test_prompts, test_labels)

# TF-IDF baseline
tfidf_baseline = BaselineProbe(method="tfidf")
tfidf_baseline.fit(positive_prompts, negative_prompts)

# Sentence length baseline (surprisingly predictive for some tasks)
length_baseline = BaselineProbe(method="sentence_length")
length_baseline.fit(positive_prompts, negative_prompts)

# Sentence-transformers embeddings (requires: pip install lmprobe[embeddings])
st_baseline = BaselineProbe(method="sentence_transformers")
st_baseline.fit(positive_prompts, negative_prompts)

# Random baseline (sanity check - should be ~50%)
random_baseline = BaselineProbe(method="random")

# Majority class baseline
majority_baseline = BaselineProbe(method="majority")
```

### Activation-Based Baselines

Test whether the learned probe direction is special compared to simpler approaches:

```python
from lmprobe import ActivationBaseline

# Random direction baseline - project onto random unit vector
random_dir = ActivationBaseline(
    method="random_direction",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    remote=True,
)
random_dir.fit(positive_prompts, negative_prompts)
random_accuracy = random_dir.score(test_prompts, test_labels)

# PCA baseline - classify using top principal components
pca_baseline = ActivationBaseline(
    method="pca",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Layer 0 baseline - use input embeddings instead of deep layers
layer0_baseline = ActivationBaseline(
    method="layer_0",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,  # Compare layer 0 to this layer
)
```

### Baseline Battery

Run all applicable baselines at once and compare to your probe:

```python
from lmprobe import BaselineBattery

# Text-only baselines (no model required)
battery = BaselineBattery(model=None, random_state=42)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

print(results.summary())
# Baseline Results:
# ------------------------------------------------------------
#   sentence_transformers          0.7925  (fit: 1.23s, predict: 0.05s)
#   tfidf                          0.7547  (fit: 0.01s, predict: 0.00s)
#   bow                            0.6792  (fit: 0.01s, predict: 0.00s)
#   ...

# Get best baseline
best = results.best()
print(f"Best baseline: {best.name} with {best.score:.2%} accuracy")

# With activation baselines (requires model)
battery = BaselineBattery(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    remote=True,
    include=["bow", "tfidf", "random_direction", "pca"],  # Select specific baselines
)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)
```

### Available Baseline Methods

| Method | Type | Description |
|--------|------|-------------|
| `bow` | Text | Bag-of-words + classifier |
| `tfidf` | Text | TF-IDF + classifier |
| `random` | Text | Random predictions (sanity check) |
| `majority` | Text | Always predict majority class |
| `sentence_length` | Text | Classify by text length |
| `sentence_transformers` | Text | Pretrained embeddings + classifier |
| `random_direction` | Activation | Project onto random unit vector |
| `pca` | Activation | Top principal components |
| `layer_0` | Activation | Input embeddings only |
| `perplexity` | Activation | Model's own token probabilities |

---

## Per-Layer Normalization

When combining multiple layers, normalize each layer's activations independently to prevent high-magnitude layers from dominating:

```python
probe = LinearProbe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[14, 15, 16],
    normalize_layers="per_layer",  # or "per_neuron" (default), or False
)
```

