Metadata-Version: 2.4
Name: ScraperSage
Version: 1.2.2
Summary: A comprehensive web scraping and content summarization library with explicit AI provider and model selection (Gemini, OpenAI, OpenRouter, DeepSeek)
Home-page: https://github.com/akillabs/ScraperSage
Author: Akil
Author-email: Akil <akil@example.com>
License: MIT
Project-URL: Homepage, https://github.com/akil/ScraperSage
Project-URL: Bug Reports, https://github.com/akil/ScraperSage/issues
Project-URL: Source, https://github.com/akil/ScraperSage
Project-URL: Documentation, https://github.com/akil/ScraperSage/blob/main/README.md
Keywords: web scraping,content summarization,search,AI,playwright,gemini,openai,openrouter,deepseek
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: ddgs
Requires-Dist: playwright
Requires-Dist: google-generativeai
Requires-Dist: openai
Requires-Dist: beautifulsoup4
Requires-Dist: tenacity
Requires-Dist: httpx
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# ScraperSage

A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using multiple providers: **Gemini**, **OpenAI**, **OpenRouter**, and **DeepSeek**.

⚠️ **Model specification is now required** - No default models to ensure explicit choice.

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![PyPI Version](https://img.shields.io/pypi/v/ScraperSage.svg)](https://pypi.org/project/ScraperSage/)

## 🚀 Features

- **Multi-Engine Search**: Combines Google (via Serper API) and DuckDuckGo search results
- **Advanced Web Scraping**: Uses Playwright for robust, JavaScript-enabled web scraping  
- **Multiple AI Providers**: Support for Gemini, OpenAI, OpenRouter, and DeepSeek
- **Explicit Model Selection**: Must specify both provider and model - no defaults
- **Dynamic Model Support**: Use any model supported by your chosen provider
- **Parallel Processing**: Concurrent scraping and summarization for improved performance
- **Retry Mechanisms**: Built-in retry logic for reliable operations
- **Structured Output**: Clean JSON output format for easy integration
- **Error Handling**: Comprehensive error handling and graceful degradation
- **Configurable Parameters**: Flexible configuration for different use cases
- **Real-time Processing**: Live status updates during processing

## 🤖 Supported AI Providers & Example Models

> **Important**: You **must** specify both provider and model - there are no default models.

### Gemini (Google)
- `gemini-1.5-flash` - Fast and efficient
- `gemini-1.5-pro` - Most capable model
- `gemini-1.0-pro` - Original Gemini model
- **Any other Gemini models** as they become available

### OpenAI
- `gpt-4o-mini` - Faster and cost-effective
- `gpt-4o` - Latest and most capable
- `gpt-4-turbo` - High performance
- `gpt-3.5-turbo` - Cost-effective option
- **Any other OpenAI models** as they become available

### OpenRouter
- `openai/gpt-4o-mini` - GPT-4o mini via OpenRouter
- `anthropic/claude-3.5-sonnet` - Anthropic's latest
- `anthropic/claude-3-haiku` - Fast Anthropic model
- `meta-llama/llama-3.1-8b-instruct` - Meta's Llama
- **Any other models** available on OpenRouter

### DeepSeek
- `deepseek-chat` - General purpose
- `deepseek-coder` - Optimized for code
- **Any other DeepSeek models** as they become available

## 📦 Installation

### From PyPI (Recommended)

```bash
pip install ScraperSage
```

### Install Playwright Browsers (Required)

```bash
playwright install chromium
```

## 🔑 API Keys Setup

You need API keys for:
1. **Serper API** (for Google Search) - [Get it here](https://serper.dev)
2. **Your chosen AI provider**:
   - **Gemini**: [Google AI Studio](https://aistudio.google.com/app/apikey)
   - **OpenAI**: [OpenAI Platform](https://platform.openai.com/api-keys)
   - **OpenRouter**: [OpenRouter](https://openrouter.ai/keys)
   - **DeepSeek**: [DeepSeek Platform](https://platform.deepseek.com/api-keys)

### Set Environment Variables

```bash
# Required for search
export SERPER_API_KEY="your_serper_api_key"

# Choose your AI provider (set one)
export GEMINI_API_KEY="your_gemini_key"
export OPENAI_API_KEY="your_openai_key" 
export OPENROUTER_API_KEY="your_openrouter_key"
export DEEPSEEK_API_KEY="your_deepseek_key"
```

## 📚 Usage Guide

### Basic Usage - Provider and Model Required

```python
from ScraperSage import scrape_and_summarize

# ✅ CORRECT: Specify both provider and model
scraper = scrape_and_summarize(provider="gemini", model="gemini-1.5-flash")
result = scraper.run({"query": "AI trends 2024"})

# ✅ CORRECT: Using OpenAI
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
result = scraper.run({"query": "AI trends 2024"})

# ❌ INCORRECT: This will raise an error
# scraper = scrape_and_summarize()  # Missing provider and model
# scraper = scrape_and_summarize(provider="openai")  # Missing model
```

### Get Available Models

```python
from ScraperSage import get_available_models, get_supported_providers

# See all supported providers
providers = get_supported_providers()
print(f"Providers: {providers}")

# Get example models for each provider
for provider in providers:
    models = get_available_models(provider)
    print(f"\n{provider.upper()} example models:")
    for model_id, description in models.items():
        print(f"  - {model_id}: {description}")
```

### Advanced Configuration

```python
# All parameters with explicit model
params = {
    "query": "machine learning in healthcare",
    "max_results": 8,
    "max_urls": 12,
    "save_to_file": True
}

# Try different providers/models
providers_to_try = [
    {"provider": "gemini", "model": "gemini-1.5-pro"},
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "openrouter", "model": "anthropic/claude-3.5-sonnet"},
    {"provider": "deepseek", "model": "deepseek-chat"}
]

for config in providers_to_try:
    try:
        scraper = scrape_and_summarize(**config)
        result = scraper.run(params)
        if result["status"] == "success":
            print(f"✅ {config['provider']} with {config['model']} worked!")
            break
    except Exception as e:
        print(f"❌ {config['provider']}/{config['model']} failed: {e}")
        continue
```

## ⚙️ Configuration Parameters

### Constructor Parameters (All Required)

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `provider` | str | ✅ **YES** | AI provider: gemini, openai, openrouter, deepseek |
| `model` | str | ✅ **YES** | Specific model name supported by the provider |
| `serper_api_key` | str | Optional | Serper API key (uses env var if not provided) |
| `provider_api_key` | str | Optional | AI provider API key (uses env var if not provided) |

### Run Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `query` | str | **Required** | The search query to process |
| `max_results` | int | 5 | Maximum search results per engine (1-20) |
| `max_urls` | int | 8 | Maximum URLs to scrape (1-50) |
| `save_to_file` | bool | False | Save results to timestamped JSON file |

## 🚨 Error Handling

### Common Errors and Solutions

```python
from ScraperSage import scrape_and_summarize, get_available_models

# ❌ Missing provider
try:
    scraper = scrape_and_summarize(model="gpt-4o")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Provider is required. Please specify one of: ['gemini', 'openai', 'openrouter', 'deepseek']

# ❌ Missing model
try:
    scraper = scrape_and_summarize(provider="openai")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Model is required for openai. Example models: ['gpt-4o-mini', 'gpt-4o', 'gpt-4-turbo']

# ✅ Get help with model selection
models = get_available_models("openai")
print(f"Available OpenAI models: {list(models.keys())}")

# ✅ Correct usage
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
```

### Safe Model Selection Helper

```python
def safe_create_scraper(provider, model_preferences):
    """Try models in order of preference."""
    for model in model_preferences:
        try:
            scraper = scrape_and_summarize(provider=provider, model=model)
            print(f"✅ Successfully initialized {provider} with {model}")
            return scraper
        except Exception as e:
            print(f"❌ {provider}/{model} failed: {e}")
            continue
    
    raise ValueError(f"No working models found for {provider}")

# Usage
openai_preferences = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
scraper = safe_create_scraper("openai", openai_preferences)
```

## 💡 Model Selection Strategy

### Recommended Approach

```python
from ScraperSage import scrape_and_summarize, get_available_models

def create_scraper_with_fallback(provider_preferences):
    """Create scraper with provider/model fallbacks."""
    
    for provider_config in provider_preferences:
        provider = provider_config["provider"]
        models = provider_config["models"]
        
        print(f"🔍 Trying {provider}...")
        for model in models:
            try:
                scraper = scrape_and_summarize(provider=provider, model=model)
                print(f"✅ Success: {provider}/{model}")
                return scraper
            except Exception as e:
                print(f"❌ Failed: {provider}/{model} - {str(e)[:50]}...")
                continue
    
    raise ValueError("No working provider/model combinations found")

# Define your preferences
preferences = [
    {
        "provider": "openai",
        "models": ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
    },
    {
        "provider": "gemini", 
        "models": ["gemini-1.5-pro", "gemini-1.5-flash"]
    },
    {
        "provider": "openrouter",
        "models": ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini"]
    }
]

scraper = create_scraper_with_fallback(preferences)
result = scraper.run({"query": "your search query"})
```

## 📊 Benefits of Explicit Model Selection

### ✅ Advantages
- **No surprises**: You always know which model is being used
- **Cost control**: Explicitly choose cost-effective models
- **Performance predictability**: Know exactly what capabilities you're getting
- **Future-proof**: New models don't change existing behavior
- **Debugging**: Easier to identify model-specific issues
- **Transparency**: Clear model usage in logs and results

### 📈 Best Practices
1. **Always specify both provider and model**
2. **Use get_available_models() to see examples**
3. **Implement fallback strategies for reliability**
4. **Test models with small queries first**
5. **Monitor costs when using premium models**
6. **Keep model preferences in configuration files**

## 🔄 Changelog

### v1.2.0 (Latest)
- ✅ **BREAKING CHANGE**: Removed default models - provider and model are now required
- ✅ **ENHANCED**: Explicit error messages when provider/model missing
- ✅ **IMPROVED**: Better model validation and error handling
- ✅ **ADDED**: Helper functions for model selection
- ✅ **UPDATED**: Documentation with explicit usage examples

### v1.1.0
- ✅ Multiple AI provider support (Gemini, OpenAI, OpenRouter, DeepSeek)
- ✅ Dynamic model support
- ✅ Provider comparison capabilities

## 🤝 Contributing

Areas where you can help:
- 🔧 Add support for more AI providers
- 🎯 Improve model validation and discovery
- 📊 Add model performance benchmarking
- 🧪 Expand test coverage for various models
- 📚 Add more model selection examples

## 📄 License

MIT License - see the [LICENSE](LICENSE) file for details.

---

**Made with ❤️ by AkilLabs**

*Now requires explicit provider and model selection for better control!*

**📦 Available on PyPI**: [https://pypi.org/project/ScraperSage/](https://pypi.org/project/ScraperSage/)
