Metadata-Version: 2.4
Name: hld-generator
Version: 0.2.0
Summary: Language-agnostic High-Level Design generator powered by Tree-sitter + LLM
Author: Harsh
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: tree-sitter>=0.22.0
Requires-Dist: tree-sitter-python>=0.23.0
Requires-Dist: tree-sitter-javascript>=0.23.0
Requires-Dist: tree-sitter-java>=0.23.0
Requires-Dist: tree-sitter-go>=0.23.0
Requires-Dist: tree-sitter-rust>=0.23.0
Requires-Dist: tree-sitter-typescript>=0.23.0
Requires-Dist: tree-sitter-c>=0.23.0
Requires-Dist: tree-sitter-cpp>=0.23.0
Requires-Dist: tree-sitter-ruby>=0.23.0
Requires-Dist: anthropic>=0.39.0
Requires-Dist: openai>=1.50.0
Requires-Dist: rich>=13.7.0
Requires-Dist: networkx>=3.2
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# HLD Generator

**Language-agnostic High-Level Design document generator** powered by Tree-sitter + LLM.

Point it at any codebase — Python, JavaScript, TypeScript, Java, Go, Rust, C/C++, Ruby, C#, Kotlin, Swift, and more — and get a complete HLD with architecture diagrams, component breakdowns, and dependency analysis.

## How It Works

```
Codebase → Scanner → Tree-sitter AST Parser → Code Graph → LLM Analysis → HLD Document
                          ↓ (fallback)
                     Regex Parser
```

1. **Scan** — Discovers source files, filters out vendor/generated code
2. **Parse** — Extracts classes, functions, imports, endpoints using Tree-sitter (with regex fallback). Parsing runs in parallel using a thread pool
3. **Graph** — Builds a dependency graph with NetworkX, identifies entry points and hub modules
4. **Analyse** — Sends structured context to an LLM (Claude or GPT) for semantic understanding, or runs static fallback analysis with `--provider none`
5. **Render** — Generates a Markdown HLD report + Mermaid architecture diagram

## Installation

```bash
# Install directly from GitHub (no clone needed)
pip install git+https://github.com/harsh-vishnoi/hld-generator.git

# Or clone and install for development
git clone https://github.com/harsh-vishnoi/hld-generator.git
cd hld-generator
pip install -e .
```

**Requirements:** Python 3.9+

> **`hld` not found after install?** pip may install the script to a directory not on your PATH (e.g. `~/.local/bin` or `~/Library/Python/3.x/bin`). Either add that directory to your PATH:
> ```bash
> # macOS
> echo 'export PATH="$PATH:$HOME/Library/Python/3.9/bin"' >> ~/.zshrc && source ~/.zshrc
>
> # Linux
> echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc && source ~/.bashrc
> ```
> Or skip the `hld` command entirely and run via Python:
> ```bash
> python -m hld_generator ./my-project --provider none
> ```

**Updating to latest version:**
```bash
pip install --upgrade git+https://github.com/harsh-vishnoi/hld-generator.git
```

## Quick Start

```bash
# With Anthropic Claude (recommended)
export ANTHROPIC_API_KEY=sk-ant-...
hld ./my-project

# With OpenAI
export OPENAI_API_KEY=sk-...
hld ./my-project --provider openai

# Without LLM (graph-only static analysis, no API key needed)
hld ./my-project --provider none

# Scan a single file
hld ./main.py --provider none

# Custom output directory
hld ./my-project -o ./docs/architecture

# Include test files + verbose logging
hld ./my-project --include-tests -v

# Run as a Python module
python -m hld_generator ./my-project --provider none
```

## Output

The tool generates three files in `./hld_output/` (or your custom output dir):

| File | Description |
|------|-------------|
| `hld_report.md` | Complete HLD document with overview, components, data flow, tech stack, and architecture diagram |
| `architecture.mmd` | Standalone Mermaid diagram file (renderable in GitHub, VS Code, etc.) |
| `graph_summary.md` | Raw code graph statistics — modules, packages, entry points, hub modules, API endpoints |

## CLI Reference

```
hld <target> [options]
```

| Flag | Default | Description |
|------|---------|-------------|
| `target` | *(required)* | Path to a source file or directory to analyse |
| `-o`, `--output` | `./hld_output` | Output directory |
| `--provider` | `anthropic` | LLM provider: `anthropic`, `openai`, `none`, or a plugin-registered name |
| `--model` | auto | LLM model name (defaults: `claude-sonnet-4-20250514` / `gpt-4o`) |
| `--format` | `markdown` | Output format: built-in `markdown` or a plugin-registered renderer name |
| `--include-tests` | off | Include test files in the analysis |
| `--max-files` | `500` | Maximum number of files to scan |
| `--max-file-size` | `512000` | Maximum file size in bytes |
| `--plugins-dir` | none | Directory containing plugin `.py` files to load |
| `-v`, `--verbose` | off | Enable debug logging |
| `--version` | — | Show version |

**Exit codes:** `0` = success, `1` = no files found, `2` = degraded (LLM failed but fallback output was produced).

**API keys:** Set via environment variables `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`. The `--api-key` flag is deprecated (visible in process list).

## Supported Languages

| Language | Tree-sitter | Regex Fallback | Endpoint Detection |
|----------|:-----------:|:--------------:|:-----------------:|
| Python | yes | yes | Flask, FastAPI, Django |
| JavaScript | yes | yes | Express |
| TypeScript | yes | yes | Express |
| Java | yes | yes | Spring Boot |
| Go | yes | yes | net/http, Gin, Chi |
| Rust | yes | yes | — |
| C | yes | yes | — |
| C++ | yes | yes | — |
| Ruby | yes | yes | — |
| C# | — | yes | — |
| Kotlin | — | yes | — |
| Swift | — | yes | — |
| PHP | — | yes | — |
| Scala | — | yes | — |

Languages without Tree-sitter grammars automatically use the regex fallback parser. Even completely unknown languages get basic extraction via generic patterns.

## Verifying the Application

Two test suites are included:

```bash
# Quick smoke test (7 tests) — validates core parsing, no external deps
python test_quick.py

# Comprehensive test suite (141 tests) — covers every module end-to-end
python test_comprehensive.py
```

You can also run a full end-to-end validation on any codebase:

```bash
# Static analysis (no API key needed)
hld ./my-project --provider none -v

# Verify output files were created
ls -la ./hld_output/

# Check exit code (0 = success)
echo $?
```

### What the Tests Cover

| Suite | Tests | Scope |
|-------|------:|-------|
| `test_quick.py` | 7 | Language map, config defaults, regex parser (Python/JS/Go/Java), file scanner |
| `test_comprehensive.py` | 141 | Config, scanner, regex parser, tree-sitter parser, graph builder, LLM fallback, renderer, plugin system, edge cases, entry point detection, section numbering, integration fixes |

## Architecture

```
hld_generator/
├── __init__.py            # Package version
├── __main__.py            # python -m hld_generator entry point
├── cli.py                 # CLI entry point & pipeline orchestrator
├── config.py              # Configuration, language maps, constants
├── scanner.py             # File discovery & filtering
├── parsers/
│   ├── base.py            # Data structures (ParsedFile, ParsedEntity, ImportInfo)
│   ├── manager.py         # Parser facade (auto-selects tree-sitter or regex, parallel parse_all)
│   ├── tree_sitter_parser.py   # Tree-sitter AST parser
│   └── regex_parser.py    # Regex fallback parser
├── graph.py               # NetworkX dependency graph builder
├── llm.py                 # LLM client (Anthropic + OpenAI + plugin dispatch)
├── fallback.py            # Static fallback analyser (no LLM needed)
├── renderer.py            # Markdown + Mermaid output renderer
├── plugins.py             # Plugin registry & hook system
└── _networkx_shim/        # Lightweight NetworkX fallback for offline use
    └── __init__.py
```

## Plugin System

HLD Generator supports plugins for custom parsers, LLM providers, renderers, and pipeline hooks.

### Loading Plugins

Plugins are loaded from:
1. `--plugins-dir <path>` — explicitly specified directory
2. `.hld_plugins/` — auto-discovered next to the target directory

Each `.py` file in the plugin directory is loaded automatically (files starting with `_` are skipped).

### Plugin Types

**Custom parser** — add support for a new language:
```python
from hld_generator.plugins import registry

@registry.register_parser("swift")
class SwiftParser:
    def parse_file(self, file_path, language):
        # Return a ParsedFile
        ...
```

**Custom LLM provider** — use a different LLM backend:
```python
@registry.register_llm_provider("ollama")
class OllamaProvider:
    def call(self, context: str, system_prompt: str) -> str:
        # Return raw LLM response text
        ...
```

Then use it: `hld ./project --provider ollama`

**Custom renderer** — output in a different format:
```python
@registry.register_renderer("html")
class HTMLRenderer:
    def render(self, analysis, code_graph, output_dir) -> list[Path]:
        # Write files and return their paths
        ...
```

Then use it: `hld ./project --format html`

**Pipeline hooks** — modify data between pipeline stages:
```python
@registry.register_hook("post_parse")
def enrich(parsed_files):
    # Modify and return parsed_files
    return parsed_files
```

### Available Hook Points

| Hook | Receives | Returns |
|------|----------|---------|
| `pre_scan` | `config` | `config` |
| `post_scan` | `scanned_files` | `scanned_files` |
| `pre_parse` | `scanned_files` | `scanned_files` |
| `post_parse` | `parsed_files` | `parsed_files` |
| `pre_graph` | `parsed_files` | `parsed_files` |
| `post_graph` | `code_graph` | `code_graph` |
| `pre_llm` | `code_graph` | `code_graph` |
| `post_llm` | `analysis` | `analysis` |
| `pre_render` | `analysis, code_graph` | `(analysis, code_graph)` |
| `post_render` | `output_files` | `output_files` |

### Other Extension Points

**Custom endpoint patterns:**
```python
registry.register_endpoint_pattern(
    r'@MyFramework\.route\("([^"]+)"\)',
    name="MyFramework"
)
```

**Custom language file extensions:**
```python
registry.register_language(".hx", "haxe")
```

## Extending (Without Plugins)

**Add a new language:**
1. Add extension mapping in `config.py` → `LANGUAGE_MAP`
2. Add regex patterns in `parsers/regex_parser.py` → `_PATTERNS`
3. (Optional) Add tree-sitter grammar to `config.py` → `TREE_SITTER_GRAMMARS` and queries to `tree_sitter_parser.py` → `_QUERIES`

**Add framework endpoint detection:**
Add patterns to `parsers/regex_parser.py` → `_ENDPOINT_PATTERNS`

## License

MIT
