Metadata-Version: 2.4
Name: kgrab
Version: 0.1.2
Summary: Scrape framework/package documentation and generate an AGENTS.md knowledge file.
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Requires-Dist: beautifulsoup4>=4.11
Requires-Dist: mcp[cli]>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: responses>=0.23; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/Bonhollow/kgrab/main/logo.png" alt="kgrab logo" width="500">
</p>

<p align="center">
  <strong>Scrape any framework's documentation and generate an AGENTS.md knowledge file for AI agents.</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/kgrab/"><img src="https://img.shields.io/pypi/v/kgrab?color=blue" alt="PyPI"></a>
  <a href="https://github.com/Bonhollow/kgrab"><img src="https://img.shields.io/github/stars/Bonhollow/kgrab?style=social" alt="GitHub Stars"></a>
  <a href="https://github.com/Bonhollow/kgrab/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Bonhollow/kgrab" alt="License"></a>
</p>

---

## What is kgrab?

**kgrab** crawls the full documentation of any framework or package and generates a structured `AGENTS.md` file — a knowledge base that makes AI agents aware of all the features, APIs, and patterns of a library.

It works as a **CLI tool**, a **pip package**, and an **MCP server** (Model Context Protocol) so AI assistants like Claude, Gemini, or Cursor can call it as a tool.

---

## Quick Start

```bash
pip install kgrab
```

### CLI usage

```bash
# Scrape docs and generate AGENTS.md (default max 500 pages)
doc-scrape https://docs.agno.com/introduction

# Compact mode (70% smaller output)
doc-scrape https://docs.agno.com/introduction --compact

# Full scrape (ignores the 500 max pages limit)
doc-scrape https://docs.agno.com/introduction --full

# Custom output path and package name
doc-scrape https://docs.agno.com/introduction -o agno_AGENTS.md -n "Agno" --compact
```

### MCP server usage

Add to your MCP client config (Claude Desktop, Gemini CLI, Cursor, etc.):

```json
{
  "mcpServers": {
    "kgrab": {
      "command": "doc-scrape-mcp"
    }
  }
}
```

The MCP server exposes two tools:

| Tool | Description |
|------|-------------|
| `scrape_documentation` | Scrapes docs and saves `AGENTS.md` to disk |
| `scrape_documentation_to_text` | Scrapes docs and returns the markdown content directly |

---

## CLI Options

| Flag | Default | Description |
|------|---------|-------------|
| `url` | *(required)* | Entry-point URL of the documentation |
| `-o / --output` | `AGENTS.md` | Output file path |
| `-n / --name` | auto-detect | Human-friendly package name |
| `--max-pages` | 500 | Maximum pages to scrape |
| `--full` | off | Scrape all reachable pages (ignores `--max-pages`) |
| `--delay` | 0.25 | Seconds between HTTP requests |
| `-c / --compact` | off | Compress output: truncate prose, cap code, remove thin pages |
| `-v / --verbose` | off | Enable debug logging |

---

## Compact Mode

Large documentation sites can produce huge AGENTS.md files (200K+ tokens). Use `--compact` to aggressively reduce size:

| What it does | Effect |
|---|---|
| Removes thin/empty pages | Drops index and redirect pages |
| Deduplicates by title | Keeps first occurrence only |
| Strips boilerplate | Removes copyright, "Was this helpful?" etc. |
| Truncates prose | 2 sentences per section |
| Caps code blocks | 10 lines max, first block per section only |
| Cleans titles | Strips redundant site-name suffixes |

**Example:** Agno docs (500 pages) goes from **~247K tokens → ~75K tokens** (70% reduction).

---

## Tested Frameworks & Layouts

kgrab handles HTML layouts generated by diverse tooling (Docusaurus, VitePress, Sphinx, custom, etc.). Here is a smoke test across 10 sites using `--compact`:

| Framework | Pages* | Sections | Est. Tokens | Size | Status |
|-----------|-------:|---------:|------------:|-----:|:------:|
| FastAPI | 15 | 327 | ~2.4k | 9.4 KB | ✅ |
| Flask | 15 | 115 | ~14.8k | 57.9 KB | ✅ |
| React | 15 | 207 | ~13.8k | 54.2 KB | ✅ |
| Vue.js | 15 | 160 | ~15.4k | 60.4 KB | ✅ |
| Next.js | 15 | 53 | ~3.2k | 12.5 KB | ✅ |
| Pydantic | 15 | 283 | ~17.3k | 67.8 KB | ✅ |
| Tailwind | 15 | 159 | ~7.0k | 27.7 KB | ✅ |
| Stripe | 15 | 168 | ~11.6k | 45.5 KB | ✅ |
| Hono | 15 | 116 | ~7.4k | 29.0 KB | ✅ |
| LangChain | 1 | 1 | ~168 | 0.7 KB | ⚠️ |

*\* Capped at 15 pages for testing speed. Run with `--full` to scrape all pages.*

### ⚠️ Known Limitation: Single Page Apps (SPAs)
kgrab relies on standard HTTP fetching (`requests`) and does not evaluate client-side JavaScript. Sites like LangChain's python docs that load content purely dynamically via JS will only yield the initial rendered HTML (often a single blank page). Supporting JS-heavy sites via a headless browser may be considered depending on use case.

---

## How It Works

1. **Crawl** — starting from the given URL, kgrab follows internal navigation links (sidebar, next/prev, etc.) and collects all reachable documentation pages under the same domain scope.
2. **Extract** — for each page it extracts headings, body text, and code examples while discarding chrome (nav, footer, scripts).
3. **Generate** — the collected content is assembled into a structured `AGENTS.md` with a table of contents, per-page sections, and inline code blocks.

---

## Development

```bash
git clone https://github.com/Bonhollow/kgrab.git
cd kgrab
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -v
```

---

## License

[Apache 2.0](LICENSE)
