Metadata-Version: 2.4
Name: ankur_scraper
Version: 0.1.6
Summary: A robust website scraper that supports static and dynamic pages with intelligent content extraction.
Home-page: https://github.com/AnkurSolutions/ankur-scraper
Author: Ankur Dev
Author-email: Ankur Dev <Dev@ankursolutions.com>
License: MIT
Project-URL: Homepage, https://github.com/AnkurSolutions/ankur-scraper
Project-URL: Repository, https://github.com/AnkurSolutions/ankur-scraper
Project-URL: Issues, https://github.com/AnkurSolutions/ankur-scraper/issues
Project-URL: Documentation, https://github.com/AnkurSolutions/ankur-scraper/blob/main/README.md
Keywords: web scraping,crawler,html extraction,dynamic content,playwright
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: tldextract>=5.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: httpx>=0.24.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# 🕷️ Ankur Scraper

**Ankur Scraper** is a modular, production-ready website scraping tool built with Python.
It crawls and extracts structured content from websites — including dynamic pages rendered with JavaScript — and saves the results in a clean JSON format.

---

## 🚀 Features

* ✅ Crawl internal links (with max depth)
* ✅ Extract visible, structured text (section-wise)
* ✅ Supports static and dynamic (JS-rendered) pages
* ✅ Respects `robots.txt`
* ✅ CLI interface with arguments
* ✅ Logs everything to file + rich-colored terminal
* ✅ Testable, extensible, and publishable as a Python package

---

## 📦 Project Structure

```bash
ankur_scraper/
├── ankur_scraper
│   ├── core/              # Crawling and extraction logic
│   ├── logs/              # Package-level logs
│   ├── cli.py             # Command-line interface
│   ├── dispatcher.py      # Orchestrates scraper execution
│   └── _version.py
├── tests/                 # Unit & integration tests
├── docs/                  # 📖 Documentation (per-module)
├── logging_config.py
├── setup.py
├── requirements.txt
└── README.md
```

### 📖 Module Documentation

* [Core Scrapers (`core/`)](ankur_scraper/docs/core.md)
* [Command Line Interface (`cli.py`)](ankur_scraper/docs/cli.md)
* [Dispatcher (`dispatcher.py`)](ankur_scraper/docs/dispatcher.md)
* [Logging System](ankur_scraper/docs/logging.md)
* [Testing Guide](ankur_scraper/docs/testing.md)

---

## 🔧 Installation

Clone the project and install dependencies:

```bash
git clone https://github.com/your-org/ankur_scraper.git
cd ankur_scraper
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

pip install -r requirements.txt

# Install Playwright drivers (for dynamic scraping)
playwright install
```

---

## 🕹️ Usage (CLI)

```bash
python -m ankur_scraper.cli \
  --url "https://example.com" \
  --depth 1 \
  --dynamic \
  --timeout 10
```

### Command Line Options

| Option      | Description                         |
| ----------- | ----------------------------------- |
| `--url`     | Starting URL to scrape (required)   |
| `--depth`   | How deep to crawl within the domain |
| `--dynamic` | Use dynamic scraping (Playwright)   |
| `--timeout` | Timeout for each page (seconds)     |

---

## 📝 Usage Examples

```bash
# Basic usage
ankur-scraper --url https://example.com

# With depth and output
ankur-scraper --url https://example.com --depth 2

# With dynamic scraping and timeout
ankur-scraper --url https://example.com --dynamic --timeout 30
```

```python
from ankur_scraper.dispatcher import run_scraper

data = run_scraper("https://example.com/", depth=1)

{
  "data": [
    {
      "content": "...",
      "metadata": {...}

    }
  ],
  "summary": {
    "successful_pages": success_count,
    "failed_pages": fail_count,
    "total_sections": total_sections
  }
}
```

---

## 🧪 Running Tests

Unit tests:

```bash
pytest tests/ --tb=short
```

Integration tests (live web):

```bash
pytest tests/test_integration.py -m integration
```

Add pytest.ini for markers:

```ini
[pytest]
markers =
    integration: mark tests as integration
```

---

## 📄 Output Format

Every page section is saved as a structured object:

```json
{
  "content": "Text content here...",
  "metadata": {
    "section": "About Us",
    "source_url": "https://example.com/about",
    "extraction_time": "2025-07-14 12:34:56"
  }
}
```

---

## 📚 Logging

Logs are written to both terminal and file:

* `logs/info.log` → general operations
* `logs/error.log` → failed links and errors
* `logs/general.log` → warnings, summaries

Terminal output is **rich-colored** with emojis and timestamps.
(See [Logging Docs](docs/logging.md) for details.)

---

## 📦 Packaging & Publishing

This scraper is structured as a pip-installable package.

Install locally:

```bash
pip install .
```

Run from anywhere:

```bash
ankur-scraper --url "https://example.com"
```

### Publish to PyPI

```bash
pip install build twine
python -m build
twine upload dist/*
```

⚠️ Update version in `setup.py` before publishing!

---

## 📌 Dependencies

* `httpx` – Fast, async-capable HTTP requests
* `beautifulsoup4 + lxml` – HTML parsing
* `tldextract` – Domain filtering
* `playwright` – JS rendering (headless)
* `rich` – Beautiful terminal output
* `pytest` – Testing

---

## 🤝 Contributing

1. Fork the repo
2. Make changes in a branch
3. Run tests: `pytest`
4. Submit a PR

---

## 🧠 Future Roadmap

* Save to Markdown or plaintext
* URL exclusion filters
* Config file / ENV mode
* Docker support
* CI/CD pipeline

---

## 🧑‍💻 Maintainer

Made with ❤️ by **Ankur Global Solutions**
