Metadata-Version: 2.4
Name: kgrab
Version: 0.1.3
Summary: Scrape framework/package documentation and generate an AGENTS.md knowledge file.
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Requires-Dist: beautifulsoup4>=4.11
Requires-Dist: mcp[cli]>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: responses>=0.23; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/Bonhollow/kgrab/main/logo.png" alt="kgrab logo" width="500">
</p>

<p align="center">
  <strong>Scrape any framework's documentation and generate an AGENTS.md knowledge file for AI agents.</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/kgrab/"><img src="https://img.shields.io/pypi/v/kgrab?color=blue" alt="PyPI"></a>
  <a href="https://github.com/Bonhollow/kgrab"><img src="https://img.shields.io/github/stars/Bonhollow/kgrab?style=social" alt="GitHub Stars"></a>
  <a href="https://github.com/Bonhollow/kgrab/blob/main/LICENSE"><img src="https://img.shields.io/github/license/Bonhollow/kgrab" alt="License"></a>
</p>

---

## What is kgrab?

**kgrab** crawls the full documentation of any framework or package and generates a structured `AGENTS.md` file — a knowledge base that makes AI agents aware of all the features, APIs, and patterns of a library.

It works as a **CLI tool**, a **pip package**, and an **MCP server** (Model Context Protocol) so AI assistants like Claude, Gemini, or Cursor can call it as a tool.

---

## Quick Start

```bash
pip install kgrab
```

### CLI usage

```bash
# Scrape docs and generate AGENTS.md (default max 500 pages)
doc-scrape https://docs.agno.com/introduction

# Compact mode (70% smaller output)
doc-scrape https://docs.agno.com/introduction --compact

# Full scrape (ignores the 500 max pages limit)
doc-scrape https://docs.agno.com/introduction --full

# Scrape an SPA (Single Page Application) using Cloudflare Browser Rendering
export CF_ACCOUNT_ID="your_account_id"
export CF_API_TOKEN="your_api_token"
doc-scrape https://python.langchain.com/docs/introduction/ --compact
```

### MCP server usage

Add to your MCP client config (Claude Desktop, Gemini CLI, Cursor, etc.):

```json
{
  "mcpServers": {
    "kgrab": {
      "command": "doc-scrape-mcp",
      "env": {
        "CF_ACCOUNT_ID": "optional_cloudflare_account_id",
        "CF_API_TOKEN": "optional_cloudflare_api_token"
      }
    }
  }
}
```

The MCP server exposes two tools: `scrape_documentation` and `scrape_documentation_to_text`. Both accept Cloudflare credentials optionally to scrape JS-heavy websites.

---

## CLI Options

| Flag | Default | Description |
|------|---------|-------------|
| `url` | *(required)* | Entry-point URL of the documentation |
| `-o / --output` | `AGENTS.md` | Output file path |
| `-n / --name` | auto-detect | Human-friendly package name |
| `--max-pages` | 500 | Maximum pages to scrape |
| `--full` | off | Scrape all reachable pages (ignores `--max-pages`) |
| `--delay` | 0.25 | Seconds between HTTP requests (local scraper only) |
| `-c / --compact` | off | Compress output: truncate prose, cap code, remove thin pages |
| `--cf-account` | `$CF_ACCOUNT_ID` | Cloudflare Account ID for SPA crawling |
| `--cf-token` | `$CF_API_TOKEN` | Cloudflare API Token for SPA crawling |
| `-v / --verbose` | off | Enable debug logging |

---

## Compact Mode

Large documentation sites can produce huge AGENTS.md files (200K+ tokens). Use `--compact` to aggressively reduce size:

| What it does | Effect |
|---|---|
| Removes thin/empty pages | Drops index and redirect pages |
| Deduplicates by title | Keeps first occurrence only |
| Strips boilerplate | Removes copyright, "Was this helpful?" etc. |
| Truncates prose | 2 sentences per section |
| Caps code blocks | 10 lines max, first block per section only |
| Cleans titles | Strips redundant site-name suffixes |

**Example:** Agno docs (500 pages) goes from **~247K tokens → ~75K tokens** (70% reduction).

---

## Cloudflare SPA Crawling

By default, kgrab uses standard HTTP requests to fetch pages. This is extremely fast but fails on Pure Single Page Apps (SPAs) like React-heavy sites (e.g., Langchain's python docs) that require JavaScript execution to render content.

To solve this, kgrab seamlessly integrates with the **Cloudflare Browser Rendering Crawl API**. 

If you provide a `--cf-account` and `--cf-token` (or set `CF_ACCOUNT_ID` and `CF_API_TOKEN` environment variables), kgrab will automatically:
1. Dispatch an asynchronous headless browser crawl job to Cloudflare.
2. Cloudflare will recursively evaluate JavaScript, follow links, and scrape content.
3. kgrab polls the job and fetches the rendered Markdown once complete.

This guarantees perfect compatibility with **all modern documentation frameworks**, including SPAs.

---

## Tested Frameworks & Layouts

kgrab handles HTML layouts generated by diverse tooling (Docusaurus, VitePress, Sphinx, custom, etc.). Here is a smoke test across 10 sites using `--compact`:

| Framework | Pages* | Sections | Est. Tokens | Size | Status |
|-----------|-------:|---------:|------------:|-----:|:------:|
| FastAPI | 15 | 327 | ~2.4k | 9.4 KB | ✅ |
| Flask | 15 | 115 | ~14.8k | 57.9 KB | ✅ |
| React | 15 | 207 | ~13.8k | 54.2 KB | ✅ |
| Vue.js | 15 | 160 | ~15.4k | 60.4 KB | ✅ |
| Next.js | 15 | 53 | ~3.2k | 12.5 KB | ✅ |
| Pydantic | 15 | 283 | ~17.3k | 67.8 KB | ✅ |
| Tailwind | 15 | 159 | ~7.0k | 27.7 KB | ✅ |
| Stripe | 15 | 168 | ~11.6k | 45.5 KB | ✅ |
| Hono | 15 | 116 | ~7.4k | 29.0 KB | ✅ |

*\* Capped at 15 pages for testing speed. Run with `--full` to scrape all pages.*

---

## How It Works

1. **Crawl** — starting from the given URL, kgrab follows internal navigation links (sidebar, next/prev, etc.) and collects all reachable documentation pages under the same domain scope.
2. **Extract** — for each page it extracts headings, body text, and code examples while discarding chrome (nav, footer, scripts).
3. **Generate** — the collected content is assembled into a structured `AGENTS.md` with a table of contents, per-page sections, and inline code blocks.

---

## Development

```bash
git clone https://github.com/Bonhollow/kgrab.git
cd kgrab
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -v
```

---

## License

[Apache 2.0](LICENSE)
