Metadata-Version: 2.4
Name: universal-scraper
Version: 1.0.0
Summary: AI-powered web scraping with customizable field extraction
Home-page: https://github.com/Ayushi0405/Universal_Scrapper
Author: Ayushi Gupta & Pushpender Singh
Author-email: aayushi.gupta0405@gmail.com
Project-URL: Bug Reports, https://github.com/Ayushi0405/Universal_Scrapper/issues
Project-URL: Source, https://github.com/Ayushi0405/Universal_Scrapper
Project-URL: Documentation, https://github.com/Ayushi0405/Universal_Scrapper/wiki
Keywords: web scraping,ai,data extraction,beautifulsoup,gemini,automation,html parsing,structured data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: google-generativeai>=0.3.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: requests>=2.28.0
Requires-Dist: selenium>=4.0.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: html5lib>=1.1
Requires-Dist: fake-useragent>=1.2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Universal Scraper Python Module

A Python module for AI-powered web scraping with customizable field extraction using Google's Gemini AI.

## Features

- 🤖 **AI-Powered Extraction**: Uses Google Gemini to intelligently extract structured data
- 🎯 **Customizable Fields**: Define exactly which fields you want to extract (e.g., company name, job title, salary)
- 🔧 **Easy to Use**: Simple API for both quick scraping and advanced use cases
- 📦 **Modular Design**: Built with clean, modular components
- 🛡️ **Robust**: Handles edge cases, missing data, and various HTML structures
- 💾 **JSON Output**: Clean, structured JSON output with metadata

## Installation

1. **Clone the repository**:
   ```bash
   git clone <repository-url>
   cd Universal_Scrapper
   ```

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

   Or install manually:
   ```bash
   pip install google-generativeai beautifulsoup4 requests selenium lxml html5lib fake-useragent
   ```

3. **Install the module**:
   ```bash
   pip install -e .
   ```

## Quick Start

### 1. Set up your API key

Get a Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey) and set it as an environment variable:

```bash
export GEMINI_API_KEY="your_gemini_api_key_here"
```

### 2. Basic Usage

```python
from universal_scraper import UniversalScraper

# Initialize the scraper
scraper = UniversalScraper(api_key="your_gemini_api_key")

# Set the fields you want to extract
scraper.set_fields([
    "company_name", 
    "job_title", 
    "apply_link", 
    "salary_range",
    "location"
])

# Scrape a URL
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)

print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")
```

### 3. Convenience Function

For quick one-off scraping:

```python
from universal_scraper import scrape

data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"]
)

print(data['data'])  # The extracted data
```

## Advanced Usage

### Multiple URLs

```python
scraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])

urls = [
    "https://site1.com/products",
    "https://site2.com/items", 
    "https://site3.com/listings"
]

results = scraper.scrape_multiple_urls(urls, save_to_files=True)

for result in results:
    if result.get('error'):
        print(f"Failed {result['url']}: {result['error']}")
    else:
        print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")
```

### Custom Configuration

```python
scraper = UniversalScraper(
    api_key="your_api_key",
    temp_dir="custom_temp",      # Custom temporary directory
    output_dir="custom_output",  # Custom output directory  
    log_level=logging.DEBUG      # Enable debug logging
)

# Configure for e-commerce scraping
scraper.set_fields([
    "product_name",
    "product_price", 
    "product_rating",
    "product_reviews_count",
    "product_availability",
    "product_description"
])

result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)
```

## API Reference

### UniversalScraper Class

#### Constructor
```python
UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO)
```

- `api_key`: Gemini API key (optional if GEMINI_API_KEY env var is set)
- `temp_dir`: Directory for temporary files
- `output_dir`: Directory for output files
- `log_level`: Logging level

#### Methods

- `set_fields(fields: List[str])`: Set the fields to extract
- `get_fields() -> List[str]`: Get current fields configuration
- `scrape_url(url: str, save_to_file=False, output_filename=None) -> Dict`: Scrape a single URL
- `scrape_multiple_urls(urls: List[str], save_to_files=True) -> List[Dict]`: Scrape multiple URLs

### Convenience Function

```python
scrape(url: str, api_key: str, fields: List[str]) -> Dict
```

Quick scraping function for simple use cases.

## Output Format

The scraped data is returned in a structured format:

```json
{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}
```

## Common Field Examples

### Job Listings
```python
scraper.set_fields([
    "company_name",
    "job_title", 
    "apply_link",
    "salary_range",
    "location",
    "job_description",
    "employment_type",
    "experience_level"
])
```

### E-commerce Products
```python
scraper.set_fields([
    "product_name",
    "product_price",
    "product_rating", 
    "product_reviews_count",
    "product_availability",
    "product_image_url",
    "product_description"
])
```

### News Articles
```python
scraper.set_fields([
    "article_title",
    "article_content",
    "article_author",
    "publish_date", 
    "article_url",
    "article_category"
])
```

## Testing

Run the test suite to verify everything works:

```bash
python test_module.py
```

## Example Files

- `example_usage.py`: Comprehensive examples of different usage patterns
- `test_module.py`: Test suite for the module

## How It Works

1. **HTML Fetching**: Uses cloudscraper to fetch HTML content, handling anti-bot measures
2. **HTML Cleaning**: Removes noise, ads, and unnecessary elements while preserving structure
3. **AI Extraction**: Uses Google Gemini to generate custom BeautifulSoup code for your specific fields
4. **Data Processing**: Executes the generated code to extract structured data
5. **Output**: Returns clean JSON data with metadata

## Troubleshooting

### Common Issues

1. **API Key Error**: Make sure your Gemini API key is valid and set correctly
2. **Empty Results**: The AI might need more specific field names or the page might not contain the expected data
3. **Network Errors**: Some sites block scrapers - the tool uses cloudscraper to handle most cases

### Debug Mode

Enable debug logging to see what's happening:

```python
import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `python test_module.py`
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Changelog

### v1.0.0
- Initial release
- AI-powered field extraction
- Customizable field configuration
- Multiple URL support
- Comprehensive test suite
