Metadata-Version: 2.4
Name: crawlier
Version: 0.0.0.1
Summary: A compact, multi-threaded web crawler for desktop (PansiluBot) and mobile (MethmiBot) modes
Home-page: https://github.com/yourusername/crawlier
Author: Your Name
Author-email: Your Name <your.email@example.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/crawlier
Project-URL: Documentation, https://github.com/yourusername/crawlier#readme
Project-URL: Repository, https://github.com/yourusername/crawlier.git
Project-URL: BugTracker, https://github.com/yourusername/crawlier/issues
Keywords: crawler,web-scraping,multi-threading,bot,seo
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: dnspython>=2.2.0
Requires-Dist: tqdm>=4.64.0
Requires-Dist: gradio>=3.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.2.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=0.991; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Crawlier — Alpha v0.0.0.1

> A compact, friendly README for the Crawlier project (PansiluBot / MethmiBot).

---

## What is Crawlier?

Crawlier is a Python web crawler that supports both a desktop mode (PansiluBot) and a mobile mode (MethmiBot). It focuses on practical, configurable crawling with useful data extraction features such as metadata, links, forms, images, files, technology detection, and optional CAPTCHA integration.

## Features

- Multi-threaded crawling
- Configurable crawl depth and per-request delay
- Optional respect for `robots.txt` (toggleable)
- Keyword, metadata, images, videos, forms, files and social link extraction
- Technology detection and basic SEO analysis
- Outputs to JSON, CSV and an SQLite database
- Optional CAPTCHA solver integration (requires external service)
- Live logging support (suitable for UIs like Gradio)

---

## Quick Install

Clone the repo and install dependencies:

```bash
git clone https://github.com/yourusername/crawlier.git
cd crawlier
pip install -r requirements.txt
```

Run tests (optional):

```bash
python -m unittest discover tests
```

Note: Python 3.10+ recommended.

---

## CLI Usage

Run a basic crawl:

```bash
python -m crawlier -d example.com -m pc
```

Deep crawl (mobile UA, many threads):

```bash
python -m crawlier -d example.com -m mobile -t 20 --delay 0.5 --depth 5
```

Ignore `robots.txt` (only if you have permission):

```bash
python -m crawlier -d example.com -m pc --no-robots
```

CLI options summary:

- `-d, --domain` : Target domain to crawl (required)
- `-m, --mode` : Crawler mode (`mobile` / `pc`), default: `pc`
- `-t, --threads` : Max concurrent threads (default: 10)
- `--delay` : Delay between requests in seconds (default: 1.0)
- `--depth` : Maximum crawl depth (default: 3)
- `--no-robots` : Ignore `robots.txt` rules
- `--captcha-key` : API key for CAPTCHA solving service
- `-o, --output` : Output file (default: `crawl_results.json`)
- `--db` : SQLite database file (default: `crawl_data.db`)

---

## Python API (programmatic)

Use `run_crawl()` for live logs (generator style):

```python
from crawlier import run_crawl

for log in run_crawl("https://example.com", mode="pc", max_threads=5, delay=1, max_depth=2):
	print(log)
```

Use the `Crawlier` class directly for fine-grained control:

```python
from crawlier import Crawlier

crawler = Crawlier(
	target_domain="example.com",
	mode="pc",
	max_threads=5,
	delay=1,
	max_depth=2,
	respect_robots=True,
	captcha_solver=None,
	db_file="crawl_data.db"
)

crawler.start_crawl()
crawler.save_results("crawl_results.json")
crawler.close()
```

---

## Output

Crawlier writes:

- JSON output (default `crawl_results.json`)
- CSV export of URL details
- An SQLite DB (default `crawl_data.db`)
- A human-friendly text report alongside JSON (if enabled)

Make sure the `output/` directory exists before running, or provide a path that does.

---

## Example: `run_crawl()` with mobile UA

```python
for log in run_crawl(
	url="https://example.com",
	mode="mobile",
	max_threads=10,
	delay=0.5,
	max_depth=3,
	ignore_robots=False
):
	print(log)
```

---

## Changelog — Alpha v0.0.0.1

- Initial release
- Supports PC (PansiluBot) and Mobile (MethmiBot) modes
- Multi-threaded crawling with configurable depth/delay
- Robots.txt respect toggle
- Optional CAPTCHA solving (requires external service)
- Live logging for UI integration (Gradio)

---

## Notes & Tips

- Use conservative `max_threads` / `delay` values on large sites to avoid rate-limiting.
- Respect site terms of service and robots.txt unless you have permission to ignore it.
- For CAPTCHA handling or browser-level JS challenges, integrate a headless browser or external solver.

---

If you'd like, I can also:

- add badges (build / python version)
- create a minimal `setup.py` / `pyproject.toml`
- extract `requirements.txt` and validate imports

Enjoy!
