Metadata-Version: 2.4
Name: doculift
Version: 0.1.0
Summary: A powerful CLI & web scraper that lifts documentation for Large Language Models.
Author: M.J. Shetty
License: MIT
Project-URL: Homepage, https://github.com/mjshetty/doculift
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: flask>=3.0.0
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: playwright
Requires-Dist: click
Requires-Dist: rich
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: bandit; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# DocuLift

**DocuLift** is a web scraping tool that lifts documentation websites into clean, aggregated files optimized for feeding into Large Language Models like Google NotebookLM, Claude, or ChatGPT.

It handles dynamic Single Page Applications (SPAs), respects site structure, and produces output in two modes: full content extraction or URL-only extraction.

---

## Features

- **Two Extract Modes** — choose between extracting full page content or just collecting URLs (see [When to Use Each Mode](#when-to-use-each-mode))
- **Dynamic Content Scraping** — uses Playwright (headless Chromium) to render JavaScript-heavy sites (React, Vue, etc.) before extraction
- **Smart Scoping**:
  - **Section Only** — stays within the folder boundary of the starting URL (e.g. starting at `.../docs/agents/overview` scrapes everything under `.../docs/agents/`)
  - **Entire Domain** — crawls all pages under the target domain
- **Intelligent Aggregation** — combines multiple pages into single files, auto-splits at ~500KB (NotebookLM's per-file limit), generates meaningful filenames
- **Multi-URL Support** — submit multiple starting URLs in one job; each is crawled independently and produces its own output file(s)
- **Per-URL stats** — on completion, the UI shows how many pages or URLs were collected per starting URL
- **Clean Extraction** — removes navigation, footers, sidebars, ads, and scripts; focuses on main content

---

## When to Use Each Mode

### Extract Content
Crawls each page and converts its content to Markdown (or text/CSV). Use this when you want to feed documentation directly into an LLM as context.

- **Best for**: NotebookLM, Claude Projects, ChatGPT — any tool that accepts uploaded documents
- **Output**: One or more `.md` files per starting URL, split at ~500KB
- **Typical workflow**: Extract content → upload files to NotebookLM → ask questions

### Extract URLs Only
Crawls the site and collects every discovered URL within scope, writing them to a plain `.txt` file — one URL per line, no other content.

**Use this when NotebookLM's URL limit is the bottleneck.**

NotebookLM supports adding web URLs as sources, but has a cap on how many you can add per notebook. When a documentation section has hundreds of pages, you'll hit that limit quickly. The recommended two-step workflow is:

1. **Run "Extract URLs Only"** on the target documentation to get a full list of all pages within scope
2. **Review and trim** the URL list down to the most relevant pages
3. **Add the trimmed URLs directly to NotebookLM** as web sources — NotebookLM fetches and indexes them itself, giving you live, citable sources rather than static file uploads

This approach gives you fine-grained control over exactly which pages NotebookLM indexes, without wasting your URL quota on irrelevant pages.

---

## Tech Stack

| Layer | Technology |
|---|---|
| Backend | Python 3.10+, Flask |
| Scraping | Playwright (headless Chromium) |
| Parsing | BeautifulSoup4 |
| Frontend | HTML5, CSS (Glassmorphism), Vanilla JS |
| CI/CD | GitHub Actions, Black, Flake8, Bandit |

---

## Continuous Integration (CI/CD)

DocuLift includes a pre-configured GitHub Actions pipeline (`.github/workflows/ci.yml`) that automatically runs on every push and pull request to the `main` or `master` branches.

The pipeline executes the following checks to ensure code quality and security:

1. **Code Formatting (Black)**
   - Automatically checks that all Python files adhere to standard `black` formatting rules.
2. **Linting (Flake8)**
   - Scans for syntax errors, undefined names, and unused imports.
   - Enforces a maximum line length and complexity thresholds.
3. **Security Scanning (Bandit)**
   - Analyzes Python code for common security vulnerabilities.
   - Ensures safe configurations (e.g., verifying `debug=False` for Flask in production environments).

*Note: The pipeline strictly fails if any high-severity security issues are found, preventing insecure code from being merged.*

---

## Installation

DocuLift is published on PyPI as `doculift`. We recommend installing it in a virtual environment or using `pipx`.

### Prerequisites
- Python 3.10 or higher

### Steps

1. **Install the package via pip**
   ```bash
   pip install doculift
   ```

2. **Install Chromium (required for dynamic page scraping)**
   ```bash
   playwright install chromium
   ```

3. **Start the Web UI**
   ```bash
   doculift ui
   ```
   Open `http://127.0.0.1:5001` in your browser.

---

## Usage

DocuLift is a hybrid tool. You can run it via a beautiful Web interface, or directly from your terminal.

### 1. Web User Interface

Start the local server:
```bash
doculift ui
# or
doculift ui --port 5001
```
Then open `http://127.0.0.1:5001` in your browser.

1. **Enter target URLs** — one per line (e.g. `https://docs.docker.com/reference/`)
2. **Choose Extract Mode** — *Extract Content* or *Extract URLs Only*
3. **Choose Scoping Strategy** — *Section Only* (recommended) or *Entire Domain*
4. **Choose Output Format** — Markdown, Plain Text, or CSV (applies to content mode)
5. **Set Max Pages per URL** — default 500; each starting URL is crawled independently up to this limit
6. **Click "Siphon Content"** and watch the progress bar
7. On completion, per-URL stats are shown and files are available for download

### 2. Command Line Interface (CLI)

Run extraction directly from your terminal with a beautiful progress bar. Files will be saved into the `./outputs` folder automatically.

```bash
# See all available commands and options
doculift --help

# See options specific to the scrape command
doculift scrape --help

# Example: Extract full markdown content from a documentation section
doculift scrape https://docs.docker.com/reference/

# Example: Extract only URLs, capped at 1000 pages, from multiple sources
doculift scrape https://paketo.io/docs/ https://docs.docker.com/ --mode urls --max-pages 1000
```

---

## How It Works

```
User submits URLs + config
        ↓
Background thread spawned (one per job)
        ↓
For each starting URL:
  ├── Determine scope (section boundary or full domain)
  ├── BFS crawl with Playwright (handles JS rendering)
  ├── [Content mode] Clean HTML → Markdown, buffer → split files at 500KB
  └── [URL mode] Collect discovered links → single .txt file
        ↓
Per-URL stats displayed, files available for download
```

**Key crawl behaviours:**
- Each starting URL gets an independent BFS with its own visited set — URLs are not cross-contaminated between starting points
- `max_pages` applies per starting URL, not globally
- Pages already scraped by an earlier starting URL in the same job are skipped to avoid duplication
- Fragment URLs (`#anchor`) are normalised and deduplicated

---

## API

Trigger jobs programmatically:

```bash
curl -X POST http://127.0.0.1:5001/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.docker.com/reference/", "https://paketo.io/docs/"],
    "format": "md",
    "max_pages": 200,
    "scope_type": "section",
    "extract_mode": "content"
  }'
```

Response:
```json
{ "job_id": "abc123" }
```

Poll for status:
```bash
curl http://127.0.0.1:5001/status/abc123
```

Response fields: `status`, `progress`, `is_finished`, `files`, `per_url_stats`, `urls_extracted`.

Download a file:
```
GET /download/<job_id>/<filename>
```
