Metadata-Version: 2.4
Name: crawler-to-md
Version: 0.2.2
Summary: Convert web content to Markdown & JSON files to fuel your GPTs and agent AI!
Author-email: "Grégoire Compagnon (obeone)" <obeone@obeone.org>
License-Expression: MIT
Project-URL: Homepage, https://github.com/obeone/crawler-to-md
Project-URL: Bug_Tracker, https://github.com/obeone/crawler-to-md/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: mdformat==0.7.22
Requires-Dist: mdformat-gfm==0.4.1
Requires-Dist: mdformat_footnote==0.1.1
Requires-Dist: mdformat_frontmatter==2.0.8
Requires-Dist: mdformat_tables==1.0.0
Requires-Dist: requests==2.32.4
Requires-Dist: tqdm==4.67.1
Requires-Dist: markitdown==0.1.2
Requires-Dist: coloredlogs==15.0.1
Requires-Dist: beautifulsoup4==4.13.4
Requires-Dist: argcomplete==3.3.0

# Web Scraper to Markdown 🌐✍️

This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. It is ideal for those looking to leverage web content for AI training or analysis. 🤖💡

## 🚀 Quick Start

(Or even better, **[use Docker!](#-docker-support) 🐳**)

### Recommended installation using pipx (isolated environment)

```shell
pipx install crawler-to-md
```

### Alternatively, install with pip

```shell
pip install crawler-to-md
```

Then run the scraper:

```shell
crawler-to-md --url https://www.example.com
```

## 🌟 Features

- Scrapes web pages for content and metadata. 📄
- Filters links by base URL. 🔍
- Excludes URLs containing certain strings. ❌
- Automatically finds links or can use a file of URLs to scrape. 🔗
- Rate limiting and delay support. 🕘
- Exports data to Markdown and JSON, ready for GPT uploads. 📤
- Exports each page as an individual Markdown file if `--export-individual` is used. 📝
- Uses SQLite for efficient data management. 📊
- Configurable via command-line arguments. ⚙️
- Docker support. 🐳

## 📋 Requirements

Python 3.10 or higher is required.

Project dependencies are managed with `pyproject.toml`. Install them with:

```shell
pip install .
```

## 🛠 Usage

Start scraping with the following command:

```shell
crawler-to-md --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--base-url <BASE_URL>] [--exclude <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>]
```

Options:

- `--url`, `-u`: The starting URL. 🌍
- `--urls-file`: Path to a file containing URLs to scrape, one URL per line. If '-', read from stdin. 📁
- `--output-folder`, `-o`: Where to save Markdown files (default: `./output`). 📂
- `--cache-folder`, `-c`: Where to store the database (default: `./cache`). 💾
- `--base-url`, `-b`: Filter links by base URL (default: URL's base). 🔎
- `--title`, `-t`: Final title of the markdown file. Defaults to the URL. 🏷️
- `--exclude`, `-e`: Exclude URLs containing this string (repeatable). ❌
- `--export-individual`, `-ei`: Export each page as an individual Markdown file. 📝
- `--rate-limit`, `-rl`: Maximum number of requests per minute (default: 0, no rate limit). ⏱️
- `--delay`, `-d`: Delay between requests in seconds (default: 0, no delay). 🕒

One of the `--url` or `--urls-file` options is required.

### 📚 Log level

By default, the `WARN` level is used. You can change it with the `LOG_LEVEL` environment variable.

## 🐳 Docker Support

Run with Docker:

```shell
docker run --rm -v $(pwd)/output:/app/output -v cache:/app/cache ghcr.io/obeone/crawler-to-md --url <URL>
```

Build from source:

```shell
docker build -t crawler-to-md .
docker run --rm -v $(pwd)/output:/app/output crawler-to-md --url <URL>
```

## 🤝 Contributing

Contributions are welcome! Feel free to submit pull requests or open issues. 🌟
