Metadata-Version: 2.4
Name: llm2vec-gen
Version: 0.1.0
Summary: LLM2Vec-Gen: Generative Embeddings from Large Language Models
Project-URL: Homepage, https://mcgill-nlp.github.io/llm2vec-gen/
Project-URL: Source, https://github.com/McGill-NLP/llm2vec-gen
Author-email: Parishad Behnamghader <parishad.behnamghader@mila.quebec>, Vaibhav Adlakha <vaibhav.adlakha@mila.quebec>
License: MIT License
        
        Copyright (c) 2026 McGill NLP
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Requires-Python: >=3.8
Requires-Dist: datasets
Requires-Dist: evaluate
Requires-Dist: flash-attn==2.7.4.post1
Requires-Dist: huggingface-hub[cli]
Requires-Dist: hydra-colorlog>=1.2.0
Requires-Dist: hydra-core>=1.3.0
Requires-Dist: hydra-optuna-sweeper>=1.2.0
Requires-Dist: mteb==1.34.10
Requires-Dist: numpy
Requires-Dist: peft==0.18.0
Requires-Dist: ruff
Requires-Dist: torch==2.6.0
Requires-Dist: transformers==4.56.2
Requires-Dist: wandb
Provides-Extra: dev
Requires-Dist: black>=22.0; extra == 'dev'
Requires-Dist: flake8>=4.0; extra == 'dev'
Requires-Dist: ipywidgets; extra == 'dev'
Requires-Dist: jupyter; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# *LLM2VEC-GEN: Generative Embeddings from Large Language Models* 

[![arxiv](https://img.shields.io/badge/arXiv-XXX-b31b1b.svg)](https://arxiv.org/abs/XXX)
[![PyPi](https://img.shields.io/pypi/v/llm2vec-gen)](https://pypi.org/project/llm2vec-gen/)
[![HF Link](https://img.shields.io/badge/HF%20Models-LLM2Vec--Gen-FFD21E.svg)](https://huggingface.co/collections/McGill-NLP/llm2vec-gen)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/McGill-NLP/llm2vec-gen/blob/main/LICENSE)
[![WandB](https://img.shields.io/badge/WandB-Logs-yellow.svg)](https://wandb.ai/siva-reddy-mila-org/llm2vec-gen)

LLM2Vec-Gen is a recipe to train interpretable, generative embeddings that encode the potential answer of an LLM to a query rather than the query. This repository supports codes for training and evaluation on MTEB, AdvBench-IR, BRIGHT, plus analysis scripts (logit lens, latent lens, and generations).

## Table of Contents

- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Training](#-training)
- [Response Generation](#-response-generation)
- [Evaluation](#-evaluation)
- [Analysis Scripts](#-analysis-scripts)
- [Project Structure](#-project-structure)


## 📦 Installation
Either use Pypi or clone the repository and install in editable mode:

```bash
pip install llm2vec-gen  # or pip install -e .
```


## 🚀 Quick Start

Load a pretrained model and use it as an encoder or for generation:

```python
from llm2vec_gen import LLM2VecGenModel

model = LLM2VecGenModel.from_pretrained("McGill-NLP/LLM2Vec-Gen-Qwen3-8B")

# Encode text
emb = model.encode("Is Montreal located in Canada?")

# Generate text from Embeddings (encoder + decoder)
answer, enc_before_answer = model.generate("Is Montreal located in Canada?", max_new_tokens=100)
```
This code snippet will return the answer of the LLM2Vec-Gen model generated from the generative embeddings of the input. You can access the embeddings either from the `.encode()` function or from the `.generate()` function. 

> Yes, Montreal is a city in Canada. It is the second-largest city in the country, located in the province of Quebec. Montreal is known for its rich cultural heritage, historic architecture, and vibrant arts scene.<|end_of_text|>
>
> True tensor([[-0.2393,  0.0280, -0.5078,  ...,  0.1270,  0.6484,  0.3574]], device='cuda:0', dtype=torch.bfloat16)



## 🏋️ Training

Training is driven by **Hydra** configs under `conf/`.

```bash
# LLM2Vec-Gen training for Qwen3-4B
python scripts/train.py \
  training=llm2vec-gen \
  training.per_device_train_batch_size=32 \
  training.learning_rate=3e-4 \
  model=qwen3-4 \
  data=qwen3-4 \
  special_tokens=total_20 \
  run.wandb_run_id=qwen3-4
```

**Useful overrides:**

| Override | Description |
|----------|-------------|
| `model=qwen3-4`, `qwen3-8`, `qwen3-0.6`, … | Different model sizes or variants (see `conf/model/`) |
| `data=qwen3-4`, `tulu`, … | Training data generated by Qwen3-4B or original Tulu QA pairs (see `conf/data/`) |
| `special_tokens=total_20`, `total_10`, … | Special token set (see `conf/special_tokens/`) |
| `run.wandb_run_id=...` | Run name used for logging and saving |

Outputs are written under `outputs/`. You may find all scripts used for training the models in the paper in `train_scripts/` directory. Make sure to identify the correct cuda visible devices in each run. WandB logs are accessible from [this WandB project](https://wandb.ai/siva-reddy-mila-org/llm2vec-gen).


## 💬 Response Generation

Generate model responses for a dataset (e.g. for creating training data). Supports sharding for parallel runs.

```bash
# Tulu dataset, Qwen3-4B, 4 shards
python scripts/response_generation.py \
  --dataset_name tulu \
  --dataset_path mcgill-nlp/llm2vec-gen-tulu \
  --model_name Qwen/Qwen3-4B \
  --max_new_tokens 500 \
  --max_length 1024 \
  --batch_size 8 \
  --num_shards 4 \
  --shard_id 0 \
  --output_dir data
```
Run with `shard_id=0,1,2,3` to process the full dataset in parallel. This repository supports training with data from HF. Use `scripts/upload_responses_to_hf.py` to upload the new generations to a HF dataset with a format like in [our data](https://huggingface.co/datasets/McGill-NLP/llm2vec-gen-tulu).


## 📊 Evaluation

### MTEBv2

Run the **lite** task set (faster) or full MTEB:

```bash
python scripts/mteb_eval.py \
  --model_path McGill-NLP/LLM2Vec-Gen-Qwen3-4B \
  --output_dir outputs/LLM2Vec-Gen-Qwen3-4B \
  --task_set lite  # or `all`
```

### AdvBench-IR

```bash
python scripts/advbenchir_eval.py \
  --model_path McGill-NLP/LLM2Vec-Gen-Qwen3-06B \
  --output_dir outputs/LLM2Vec-Gen-Qwen3-06B
```

### BRIGHT

```bash
python scripts/bright_eval/run.py \
  --task biology \
  --model McGill-NLP/LLM2Vec-Gen-Qwen3-06B \
  --output_dir outputs/LLM2Vec-Gen-Qwen3-06B/bright_eval
```
Other tasks: `earth_science`, `economics`, `pony`, `psychology`, `robotics`, `stackoverflow`, `sustainable_living`, `aops`, `leetcode`, `theoremqa_theorems`, `theoremqa_questions`.

---

## 🔬 Analysis Scripts

### Logit lens

```bash
python scripts/logit_lens_analysis.py \
  --encoder_model_path McGill-NLP/LLM2Vec-Gen-Qwen3-8B \
  --output_dir outputs/hf_model/LLM2Vec-Gen-Qwen3-8B \
  --dataset generation  # or `nq` / `advbenchir`
```

### Latent lens

Requires a prebuilt index. Example:

```bash
# to build the index with Qwen3-8B
python scripts/latent_lens_build_index.py

python scripts/latent_lens_analyze_special_tokens.py \
  --checkpoint McGill-NLP/LLM2Vec-Gen-Qwen3-8B \
  --index /path/to/qwen3-8b_index \
  --top_k 1 \
  --layers 36
```

### Generations analysis

```bash
python scripts/generations_analysis.py \
  --encoder_model_path McGill-NLP/LLM2Vec-Gen-Qwen3-8B \
  --output_dir outputs/LLM2Vec-Gen-Qwen3-8B \
  --dataset generation  # or `nq` / `advbenchir`
```

---

## 📁 Project Structure

```
llm2vec-gen/
├── conf/                    # Hydra configs
│   ├── config.yaml          # Main config
│   ├── model/               # Model presets (qwen3-4, qwen3-8, …)
│   ├── data/                # Dataset presets
│   ├── training/            # Training presets (llm2vec-gen, …)
│   ├── special_tokens/      # Special token configs
│   └── run/                 # Run / wandb settings
├── llm2vec_gen/             # Package
│   ├── models/              # LLM2VecGenModel, encoder-decoder, etc.
│   ├── dataset/             # Data loading and collation
│   └─ trainer/              # Custom trainer
├── train_scripts/           # Different training bash scripts used to train main models and ablations of the paper
└── scripts/
    ├── train.py            
    ├── response_generation.py
    ├── mteb_eval.py
    ├── advbenchir_eval.py
    ├── bright_eval/         
    ├── logit_lens_analysis.py
    ├── latent_lens_analyze_special_tokens.py
    └── generations_analysis.py
```


## Citation

If you use this code, models, or data, please cite the LLM2Vec-Gen paper.

```
@article{behnamghader2026llm2vec-gen,
  title={LLM2Vec-Gen: Generative Embeddings from Large Language Models},
  author={BehnamGhader, Parishad and Adlakha, Vaibhav and Schmidt, Fabian David and Chapados, Nicolas and Mosbach, Marius and Reddy, Siva},
  journal={arXiv preprint},
  year={2026}
}
```