Metadata-Version: 2.4
Name: tree-of-attacks
Version: 0.1.0
Summary: TAP: Tree of Attacks with Pruning for black-box LLM jailbreaking
Author: Marcello Politi
License: MIT License
        
        Copyright (c) 2025 Marcello Politi
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/marcellopoliti/tree-of-attacks
Project-URL: Repository, https://github.com/marcellopoliti/tree-of-attacks
Project-URL: Issues, https://github.com/marcellopoliti/tree-of-attacks/issues
Keywords: llm,jailbreak,red-teaming,adversarial,tap,tree-of-attacks
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# Tree of Attacks with Pruning (TAP)

Implementation of **TAP** from [Tree of Attacks: Jailbreaking Black-Box LLMs Automatically](https://arxiv.org/abs/2312.02119).

**Setup:** The **attacker** and **target** are adversarial agents. The attacker produces a single prompt P; that prompt is sent to the target as the **only user message**. The target replies once, as itself. The **evaluator** scores how much that reply means the target was jailbroken (1–10). So: attacker outputs P → we send P to target → target outputs R → evaluator scores (P, R). For **direct jailbreaking**, the goal should describe the forbidden behavior you want from the target's reply (e.g. "Get the target to output instructions for X").

### How it works

TAP builds a **tree of candidate prompts**. Each round: (1) from every current **leaf** (a candidate prompt and its history), the **attacker** generates `branching_factor` new refinements; (2) the **evaluator** filters out **off-topic** prompts; (3) each remaining prompt is sent to the **target**, and the evaluator **scores** the reply (1–10); (4) we keep only the top **width** leaves by score for the next round. This runs for up to **depth** rounds. If any reply scores ≥ **jailbreak_score**, the run succeeds and returns that prompt and response.

![dag](https://raw.githubusercontent.com/marcellopoliti/tree-of-attacks/main/assets/dag.png)

## Install

```bash
pip install tree-of-attacks
```

Set `OPENAI_API_KEY` or `OPENROUTER_API_KEY` in your environment (or a `.env` file in the project root).

## Usage

```python
import os
from dotenv import load_dotenv
load_dotenv()

from taprune import TAP, OpenAILLM, OpenRouterLLM, TapConfig
from taprune.config import load_named_config

cfg = TapConfig.from_dict(load_named_config("default"))

def build_llm(model: str, temperature: float = 0.7):
    if cfg.api.get("provider") == "openrouter":
        return OpenRouterLLM(model=model, api_key=os.environ.get("OPENROUTER_API_KEY"), temperature=temperature)
    return OpenAILLM(model=model, api_key=os.environ.get("OPENAI_API_KEY"), temperature=temperature)

attacker = build_llm(cfg.models.get("attacker", "gpt-4o-mini"), cfg.api.get("attacker_temperature", 1.0))
evaluator = build_llm(cfg.models.get("evaluator", "gpt-4o"), cfg.api.get("evaluator_temperature", 0.1))
target = build_llm(cfg.models.get("target", "gpt-4o"), cfg.api.get("target_temperature", 0.3))

tap = TAP(
    attacker_llm=attacker,
    evaluator_llm=evaluator,
    target_llm=target,
    branching_factor=cfg.tap.get("branching_factor", 4),
    width=cfg.tap.get("width", 10),
    depth=cfg.tap.get("depth", 10),
    jailbreak_score=cfg.tap.get("jailbreak_score", 10),
)

result = tap.run(cfg.goal, target_system_prompt=cfg.target_system_prompt, target_chat_history=cfg.target_chat_history, prompt_overrides=cfg.resolve_prompts())
print("Success:", result.success)
# result.result_prompt, result.iteration_log, result.target_response, result.extra (None unless judge outputs e.g. Deal: A, B)
```

## Config

Configs are bundled with the package. Load by name with `load_named_config("default")`, or load a custom file with `load_config("path/to/config.yaml")`.

Bundled configs:
- **`default`** — standard jailbreak config
- **`example_extras`** — judge outputs e.g. `Deal: A, B` so `result.extra` is set via `extra_parser: parse_deal_from_reply`
- **`example_chat_history`** — target sees prior dialogue before the attack prompt

Each config defines:

- **goal**: What we want the target's reply to do (override in code if needed).
- **api.provider**: `openai` or `openrouter`
- **models**: `attacker`, `evaluator`, `target` (model IDs)
- **target_context** (optional): How the target agent is prompted. `system_prompt` = target's system message (default "You are a helpful assistant."). `chat_history` = optional list of `{role, content}` before the attack prompt.
- **prompts**: `judge`, `off_topic`, `attacker` (placeholders `[[OBJECTIVE]]`, `[[STARTING_STRING]]`)
- **tap**: `branching_factor`, `width`, `depth`, `jailbreak_score`

| Parameter | Description |
|-----------|-------------|
| **branching_factor** | Refinements per leaf per iteration (default 4) |
| **width** | Max leaves kept after pruning (default 10) |
| **depth** | Max refinement rounds (default 10) |
| **jailbreak_score** | Score ≥ this counts as success (default 10) |

## API

- **tap.run(goal, ..., extra_parser=None)** returns a **RunResult** (attributes: `result_prompt`, `iteration_log`, `target_response`, `extra`, `success`). **extra** is only set when you pass **extra_parser**: a callable `(judge_reply: str) -> Any` that parses the judge LLM reply. If you don't pass it, `result.extra` is always `None`. Use a parser from **taprune.parsers** or implement your own.
- **taprune.parsers** — built-in extra parsers for different judge outputs: **no_extra** (always None), **parse_deal_from_reply** (looks for `Deal: A, B` → `[A, B]`, e.g. negotiation configs), **raw_reply** (pass through the reply string). In YAML set `extra_parser: "parse_deal_from_reply"` (or `"no_extra"`, `"raw_reply"`) so `TapConfig.extra_parser` is resolved automatically; then pass `extra_parser=cfg.extra_parser` to `tap.run(...)`.

### Adding new parsers

An **extra parser** is a callable that takes the judge's raw reply string and returns whatever you want in `result.extra` (e.g. a number, a list, a dict, or `None`).

1. **Use in code**: Define `def my_parser(reply: str) -> Any: ...` and pass it to `tap.run(..., extra_parser=my_parser)`. No config change needed.
2. **Use from config by name**: The built-in names (`no_extra`, `parse_deal_from_reply`, `raw_reply`) are resolved from YAML via `extra_parser: "parse_deal_from_reply"`. To add a **custom** parser that is selectable from config, register it in `taprune.parsers.EXTRA_PARSERS` before loading config:

   ```python
   import taprune.parsers as parsers

   def my_custom_parser(reply: str):
       # parse reply and return whatever should go in result.extra
       return ...

   parsers.EXTRA_PARSERS["my_custom_parser"] = my_custom_parser
   # Then load config with extra_parser: "my_custom_parser" and use cfg.extra_parser in tap.run(...)
   ```

- **TapConfig.from_dict(d)** builds a typed view over the config dict; use `.goal`, `.api`, `.models`, `.tap`, `.target_system_prompt`, `.target_chat_history`, `.resolve_prompts()`.
- **taprune**: `TAP`, `Node`, `Attacker`, `Evaluator`, `Target`, `RunResult`, `TapConfig`, `no_extra`, `parse_deal_from_reply`, `raw_reply`
- **taprune.llm**: `OpenAILLM`, `OpenRouterLLM`
- **taprune.config**: `load_config(path)`, `load_named_config(name)`, `TapConfig.from_dict(load_config(path))`
- **taprune.results**: `save_result(run_id, goal, config_name, config, run_result, run_dir=None)` to write `result.json`

## Layout

```
taprune/
  tap.py       # TAP algorithm
  llm.py       # OpenAILLM, OpenRouterLLM
  config.py    # load_config, load_named_config
  prompts.py   # get_prompts
  parsers.py   # extra parsers (no_extra, parse_deal_from_reply, raw_reply); add your own
  results.py   # save_result
  configs/
    default.yaml               # standard jailbreak config
    example_extras.yaml        # extras (judge outputs Deal: A, B)
    example_chat_history.yaml  # multi-turn context before attack prompt
```

Some target providers apply content moderation and may return 403; TAP treats that as a refusal and continues.
