Metadata-Version: 2.2
Name: scrapegen
Version: 0.1.0
Summary: AI-driven web scraping framework
Author-email: Affan Shaikhsurab <affanshaikhsurab@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/affanshaikhsurab/scrapegen
Project-URL: Documentation, https://github.com/affanshaikhsurab/scrapegen/docs
Project-URL: Repository, https://github.com/affanshaikhsurab/scrapegen
Project-URL: Issues, https://github.com/affanshaikhsurab/scrapegen/issues
Keywords: AI,bing,search,scraper,web scraping,automation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.26.0
Requires-Dist: beautifulsoup4>=4.9.3
Requires-Dist: langchain-google-genai>=0.0.3
Requires-Dist: pydantic>=2.0.0
Requires-Dist: lxml>=4.9.0
Dynamic: requires-python

# ScrapeGen

<img src="https://github.com/user-attachments/assets/2f458a05-66f9-47a4-bc40-6069e3c9e849" alt="Logo" width="80" height="80">

ScrapeGen 🚀 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.

## ✨ Features

- **🤖 AI-Powered Data Extraction**: Utilizes Google's Gemini models for intelligent parsing.
- **⚙️ Configurable Web Scraping**: Supports depth control and flexible extraction rules.
- **📊 Structured Data Modeling**: Uses Pydantic for well-defined data structures.
- **🛡️ Robust Error Handling**: Implements retry mechanisms and detailed error reporting.
- **🔧 Customizable Scraping Configurations**: Adjust settings dynamically based on needs.
- **🌐 Comprehensive URL Handling**: Supports both relative and absolute URLs.
- **📦 Modular Architecture**: Ensures clear separation of concerns for maintainability.

## 📥 Installation

```bash
pip install scrapegen  # Package name may vary
```

## 📌 Requirements

- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
  - requests
  - beautifulsoup4
  - langchain
  - langchain-google-genai
  - pydantic

## 🚀 Quick Start

```python
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define the target URL
url = "https://example.com"

# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)

# Display extracted data
for company in companies_data.companies:
    print(f"🏢 Company Name: {company.company_name}")
    print(f"📄 Description: {company.company_description}")
```

## ⚙️ Configuration

### 🔹 ScrapeConfig Options

```python
from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)
```

### 🔄 Updating Configuration

```python
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
```

## 📌 Custom Data Models

Define Pydantic models to structure extracted data:

```python
from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
```

## 🤖 Supported Gemini Models

- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash

## ⚠️ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

- **❗ ScrapeGenError**: Base exception class.
- **⚙️ ConfigurationError**: Errors related to scraper configuration.
- **🕷️ ScrapingError**: Issues encountered during web scraping.
- **🔍 ExtractionError**: Problems with AI-driven data extraction.

Example usage:

```python
try:
    data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
    print(f"⚙️ Configuration error: {e}")
except ScrapingError as e:
    print(f"🕷️ Scraping error: {e}")
except ExtractionError as e:
    print(f"🔍 Extraction error: {e}")
```

## 🏗️ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

1. **🕷️ WebsiteScraper**: Handles core web scraping logic.
2. **📑 InfoExtractorAi**: Performs AI-driven content extraction.
3. **🤖 LlmManager**: Manages interactions with language models.
4. **🔗 UrlParser**: Parses and normalizes URLs.
5. **📥 ContentExtractor**: Extracts structured data from HTML elements.

## ✅ Best Practices

### 1️⃣ Rate Limiting

- ⏳ Use delays between requests.
- 📜 Respect robots.txt guidelines.
- ⚖️ Configure max_pages and max_depth responsibly.

### 2️⃣ Error Handling

- 🔄 Wrap scraping operations in try-except blocks.
- 📋 Implement proper logging for debugging.
- 🔁 Handle network timeouts and retries effectively.

### 3️⃣ Resource Management

- 🖥️ Monitor memory usage for large-scale operations.
- 📚 Implement pagination for large datasets.
- ⏱️ Adjust timeout settings based on expected response times.

## 🤝 Contributing

Contributions are welcome! 🎉 Feel free to submit a Pull Request to improve ScrapeGen.
