Metadata-Version: 2.4
Name: gleaner
Version: 0.1.2
Summary: Web scraper that finds all pages on a domain
License-Expression: MIT
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Dynamic: license-file

# gleaner

A simple web scraper that finds all the pages on a domain using breadth-first search.

## Installation

```bash
pipx install gleaner
```

## Usage

```bash
glean https://example.com -o output.txt
```

This will

- Start at given URL
- Search through HTML to find pages on the same domain
- Follow those links and repeat the process
- Save all discovered URLs to output.txt

## Options

- `-o, --output` - Output file (default: `urls.txt`)

## Limitations

- Only follows `<a>`-tag links
- Doesn't read `robots.txt`
- No user-agent header
- Duplication bug
- Doesn't handle 429 errors (too many requests)
- Doesn't parse sitemap.xml
- No export formats beyond `.txt` (json? csv?)
- Only works for entire domains. You can't glean an individual page and all sub-pages
- Doesn't store progress for larger sites where scraping might last more than a few minutes
- No concurrent processing
