Metadata-Version: 2.4
Name: treeloom
Version: 0.2.5
Summary: Language-agnostic Code Property Graph library — weave syntax trees into queryable, analyzable graphs
Author: Will Jackson
License-Expression: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Requires-Dist: networkx>=3.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tree-sitter>=0.23
Provides-Extra: all
Requires-Dist: mypy>=1.10; extra == 'all'
Requires-Dist: pytest-cov>=5.0; extra == 'all'
Requires-Dist: pytest>=8.0; extra == 'all'
Requires-Dist: ruff>=0.4; extra == 'all'
Requires-Dist: tree-sitter-c>=0.23; extra == 'all'
Requires-Dist: tree-sitter-cpp>=0.23; extra == 'all'
Requires-Dist: tree-sitter-go>=0.23; extra == 'all'
Requires-Dist: tree-sitter-java>=0.23; extra == 'all'
Requires-Dist: tree-sitter-javascript>=0.23; extra == 'all'
Requires-Dist: tree-sitter-python>=0.23; extra == 'all'
Requires-Dist: tree-sitter-rust>=0.23; extra == 'all'
Requires-Dist: tree-sitter-typescript>=0.23; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: languages
Requires-Dist: tree-sitter-c>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-cpp>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-go>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-java>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-javascript>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-python>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-rust>=0.23; extra == 'languages'
Requires-Dist: tree-sitter-typescript>=0.23; extra == 'languages'
Description-Content-Type: text/markdown

# treeloom

A language-agnostic Code Property Graph (CPG) library for Python. treeloom parses source code via tree-sitter, builds a unified graph combining AST, control flow, data flow, and call graph layers, and provides query and analysis APIs on top of it.

## Features

- **Multi-language parsing** -- Python, JavaScript, TypeScript, Go, Java, C, C++, and Rust via tree-sitter grammars
- **Unified graph model** -- AST structure, control flow, data flow, and call graphs in a single queryable graph
- **Taint analysis** -- generic label-propagation engine for tracking data flow from sources to sinks, with sanitizer support
- **Pattern matching** -- chain-based pattern queries for finding code patterns across the graph
- **Visualization** -- export to JSON, Graphviz DOT, or interactive HTML (Cytoscape.js)
- **Consumer annotations** -- attach arbitrary metadata to nodes without modifying the structural graph
- **Overlay system** -- inject visual styling for domain-specific visualization (e.g., security analysis results)
- **Serialization** -- full round-trip JSON serialization including annotations

## Quick Start

```python
from pathlib import Path
from treeloom import CPGBuilder, NodeKind, EdgeKind

# Build a CPG from a directory of source files
cpg = CPGBuilder().add_directory(Path("src/")).build()

# Inspect the graph
print(f"{cpg.node_count} nodes, {cpg.edge_count} edges")
print(f"Files: {[str(f) for f in cpg.files]}")

# Find all function definitions
for func in cpg.nodes(kind=NodeKind.FUNCTION):
    print(f"  {func.name} at {func.location}")

# Find all call sites targeting a specific function
for call in cpg.nodes(kind=NodeKind.CALL):
    if call.name == "eval":
        print(f"  eval() called at {call.location}")

# Query: what nodes are reachable from a function via data flow?
func_node = next(cpg.nodes(kind=NodeKind.FUNCTION))
reachable = cpg.query().reachable_from(
    func_node.id, edge_kinds=frozenset({EdgeKind.DATA_FLOWS_TO})
)
```

## Installation

```bash
pip install treeloom              # core only (networkx + tree-sitter)
pip install treeloom[languages]   # with all language grammars
pip install treeloom[all]         # everything (grammars + dev tools)
```

For development:

```bash
git clone https://github.com/rdwj/treeloom.git
cd treeloom
pip install -e ".[all]"
```

## Supported Languages

| Language   | Extensions         | Grammar Package           |
|------------|--------------------|---------------------------|
| Python     | `.py`, `.pyi`      | `tree-sitter-python`      |
| JavaScript | `.js`, `.mjs`, `.cjs` | `tree-sitter-javascript`  |
| TypeScript | `.ts`, `.tsx`          | `tree-sitter-typescript`  |
| Go         | `.go`              | `tree-sitter-go`          |
| Java       | `.java`            | `tree-sitter-java`        |
| C          | `.c`, `.h`         | `tree-sitter-c`           |
| C++        | `.cpp`, `.cc`, ... | `tree-sitter-cpp`         |
| Rust       | `.rs`              | `tree-sitter-rust`        |

Grammar packages are optional dependencies. The core library works without them -- you just can't parse files without the appropriate grammar installed. Missing grammars produce clear error messages, not crashes.

## Architecture

treeloom builds a Code Property Graph -- a single directed graph that unifies four views of source code.

**AST layer.** Module, class, function, parameter, variable, call, and literal nodes connected by containment edges (`CONTAINS`, `HAS_PARAMETER`). This gives you the structural hierarchy of the code.

**Control flow layer.** Statement-level flow between nodes within functions. `FLOWS_TO` edges represent sequential execution; `BRANCHES_TO` edges represent conditional or loop branching.

**Data flow layer.** Tracks where variables are defined and used, and how data propagates through assignments, function calls, and return values. Edges: `DATA_FLOWS_TO`, `DEFINED_BY`, `USED_BY`.

**Call graph layer.** Links call sites to their resolved function definitions. `CALLS` edges connect a call node to the function it invokes. Resolution is best-effort (no full type inference).

## API Overview

| Class / Function       | Purpose                                              |
|------------------------|------------------------------------------------------|
| `CPGBuilder`           | Fluent builder -- add files/directories, call `build()` |
| `CodePropertyGraph`    | Central graph object -- node/edge access, annotations, traversal, serialization |
| `GraphQuery`           | Path queries, reachability, subgraph extraction, pattern matching |
| `TaintPolicy`          | Consumer-defined source/sink/sanitizer callbacks     |
| `TaintResult`          | Taint analysis output -- paths, labels, filtering    |
| `ChainPattern`         | Declarative pattern for matching node chains          |
| `Overlay`              | Per-node/edge visual styling for HTML export         |
| `to_json` / `from_json`| JSON serialization with full round-trip support      |
| `to_dot`               | Graphviz DOT export                                  |
| `generate_html`        | Interactive HTML visualization with Cytoscape.js     |

For full API details, see `CLAUDE.md`.

## Taint Analysis

treeloom's taint engine propagates labels through data flow edges. It is generic -- the labels can represent anything (security-sensitive data, PII, environment variables). What they mean is up to you.

```python
from treeloom import (
    CPGBuilder, CodePropertyGraph, TaintPolicy, TaintLabel, NodeKind,
)
from pathlib import Path

cpg = CPGBuilder().add_directory(Path("myapp/")).build()

# Define what constitutes a source, sink, and sanitizer
policy = TaintPolicy(
    sources=lambda node: (
        TaintLabel("user_input", node.id)
        if node.kind == NodeKind.PARAMETER and node.name == "user_data"
        else None
    ),
    sinks=lambda node: (
        node.kind == NodeKind.CALL and node.name in ("exec", "eval", "os.system")
    ),
    sanitizers=lambda node: (
        node.kind == NodeKind.CALL and node.name == "sanitize"
    ),
)

result = cpg.taint(policy)

for path in result.unsanitized_paths():
    print(f"Unsanitized: {path.source.name} -> {path.sink.name}")
    print(f"  Labels: {[l.name for l in path.labels]}")
    for node in path.intermediates:
        print(f"    {node.kind.value}: {node.name} at {node.location}")
```

## Export and Visualization

### JSON

Full round-trip serialization, including annotations:

```python
from treeloom import to_json, from_json

json_str = to_json(cpg)
restored = from_json(json_str)  # equivalent graph
```

### Graphviz DOT

```python
from treeloom import to_dot, EdgeKind

# Full graph
dot = to_dot(cpg)

# Only data flow edges
dot = to_dot(cpg, edge_kinds=frozenset({EdgeKind.DATA_FLOWS_TO}))

with open("graph.dot", "w") as f:
    f.write(dot)
```

### Interactive HTML

Self-contained HTML with Cytoscape.js. Includes layer toggles, search, click-to-inspect, and overlay support.

```python
from treeloom import generate_html, Overlay, OverlayStyle

html = generate_html(cpg, title="My Project CPG")

with open("cpg.html", "w") as f:
    f.write(html)
```

## Development

Set up a local development environment:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"
```

Run tests:

```bash
pytest
pytest --cov=treeloom --cov-report=html
```

Lint and type-check:

```bash
ruff check src/ tests/
mypy src/treeloom/
```

## Changelog

### Version 0.2.5

- Chained attribute receivers (`request.form.attr`) resolve recursively through DFG
- Basic field sensitivity: `obj.safe` and `obj.unsafe` tracked as separate variables
- `--output-format` flag on query and edges: table, json, csv, tsv, jsonl
- 862 tests

### Version 0.2.4

- Python visitor: subscript (`dict['key']`) and attribute (`obj.attr`) expressions now generate DFG nodes
- Python visitor: decorated functions (Flask `@app.route`), keyword args, `**kwargs`, comprehensions now tracked
- Java visitor: string concatenation with `+` emits DFG, try-catch bodies visited, annotations captured
- Method call return values flow to assigned variables across both Python and Java
- VAmPI (Python) taint paths: 4 → 40; VulnerableApp (Java) SQL injection/XSS/command injection paths found
- Updated llms.txt and integration guide with `exclude_kinds` and `apply_to` patterns for better discoverability
- 849 tests

### Version 0.2.3

- Fixed data flow through chained method calls (`.format().fetchone()` pattern)
- New `treeloom edges` command for querying edges by kind, source/target name
- `treeloom diff --match-by-basename` and `--strip-prefix` for cross-directory comparison
- `treeloom query --scope`, `--count`, `--annotation`, `--annotation-value` filters
- Fixed `--json-errors` flag (errors now propagate to main handler for JSON formatting)
- Build `--progress` skips unsupported file types, `--language` filter restricts parsing
- DOT `--edge-kind` filter prunes disconnected nodes
- Import nodes hidden by default in HTML visualization (togglable "Imports" layer)
- `treeloom viz --exclude-kind` for consumer-controlled node filtering
- Large graph warning (>500 nodes) suggesting subgraph extraction
- 821 tests

### Version 0.2.2

- Fixed data flow tracking through string formatting (.format(), % operator, f-strings)
- Fixed parameter references not generating data flow edges (root cause of taint false negatives)
- Implemented CFG edge generation (flows_to, branches_to) connecting statements within functions
- Implemented inter-procedural data flow: call-site arguments flow to callee parameters, return values flow back
- Taint analysis on vulpy (deliberately vulnerable Flask app) went from 0 to 12 findings including cross-file HTTP-input-to-SQL-injection traces
- 776 tests

### Version 0.2.1

- New CLI commands: `annotate`, `diff`, `pattern`, `subgraph`, `watch`, `serve`, `completions`
- `--json-errors` global flag for machine-readable error output
- `--progress` flag for build command
- Multiple `--policy` files for taint policy composition
- `TaintResult.apply_to(cpg)` stamps taint annotations onto the graph
- `--apply` flag for taint command writes annotated CPG directly
- Fixed variable scoping in all visitors (ScopeStack replaces flat dict)
- Fixed import alias capture in Python, JavaScript, TypeScript visitors
- Fixed taint sanitizer tracking on convergent paths (per-origin intersection)
- Shell completions for bash, zsh, fish
- HTTP JSON API server (`treeloom serve`) with query, node, edges, subgraph endpoints
- 750 tests

### Version 0.2.0

- CLI with 7 subcommands: `build`, `info`, `query`, `taint`, `viz`, `dot`, `config`
- YAML-based taint policies for CLI-driven analysis (sources, sinks, sanitizers, propagators)
- Project and user configuration via `.treeloom.yaml` and `~/.config/treeloom/config.yaml`
- Works with `pip install treeloom`, `uvx treeloom`, and `uv tool install treeloom`
- 585 tests

### Version 0.1.0

- Initial release
- Code Property Graph with four layers: AST, control flow, data flow, call graph
- Language visitors: Python, JavaScript, TypeScript/TSX, Go, Java, C, C++, Rust
- Worklist-based taint analysis engine with inter-procedural propagation
- Pattern matching query API with wildcard support
- Export to JSON (round-trip), Graphviz DOT, and interactive HTML (Cytoscape.js)
- Consumer annotation and overlay system for domain-specific visualization
- 539 tests

## License

Apache-2.0
