Metadata-Version: 2.1
Name: fraudcrawler
Version: 0.8.1
Summary: Intelligent Market Monitoring
Home-page: https://github.com/open-veanu/fraudcrawler
License: MIT
Author: Domingo Bertus
Author-email: hello@veanu.ch
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: aiocache (>=0.12.2,<0.13.0)
Requires-Dist: beautifulsoup4 (>=4.13.4,<5.0.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: openai (>=1.68.2,<2.0.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pydantic-settings (>=2.8.1,<3.0.0)
Requires-Dist: redis (>=5.0.0,<6.0.0)
Requires-Dist: requests (>=2.32.3,<3.0.0)
Requires-Dist: tenacity (>=9.1.2,<10.0.0)
Project-URL: Repository, https://github.com/open-veanu/fraudcrawler
Description-Content-Type: text/markdown

# fraudcrawler

![CI Status](https://github.com/open-veanu/fraudcrawler/workflows/CI/badge.svg)
![Python Version](https://img.shields.io/badge/python-3.11+-blue.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)
![PyPI](https://img.shields.io/pypi/v/fraudcrawler.svg)

Fraudcrawler is an intelligent **market monitoring** tool that searches the web for products, extracts product details, and classifies them using LLMs. It combines search APIs, web scraping, and AI to automate product discovery and relevance assessment.

## Features

- **Asynchronous pipeline** - Products move through search, extraction, and classification stages independently
- **Multiple search engines** - Google Search, Google Shopping, and more...
- **Search term enrichment** - Automatically find related terms and expand your search
- **Product extraction** - Get structured product data via Zyte API
- **LLM classification** - Assess product relevance using OpenAI API with custom prompts
- **Marketplace filtering** - Focus searches on specific domains
- **Deduplication** - Avoid reprocessing previously collected URLs
- **CSV export** - Results saved with timestamps for easy tracking

## Prerequisites

- Python 3.11 or higher
- API keys for:
  - **SerpAPI** - Google search results
  - **Zyte API** - Product data extraction
  - **OpenAI API** - Product classification
  - **DataForSEO** (optional) - Search term enrichment

## Installation

```bash
python3.11 -m venv .venv
source .venv/bin/activate
pip install fraudcrawler
```

**Using Poetry:**
```bash
poetry install
```

## Configuration

Create a `.env` file with your API credentials (see `.env.example` for template):

```bash
SERPAPI_KEY=your_serpapi_key
ZYTEAPI_KEY=your_zyte_key
OPENAIAPI_KEY=your_openai_key
DATAFORSEO_USER=your_user  # optional
DATAFORSEO_PWD=your_pwd    # optional
REDIS_URL=redis://localhost:6379/0  # optional, for response caching
```

## Caching

Fraudcrawler uses Redis-backed caching to avoid duplicate expensive API calls when re-running pipelines during debugging. External API responses (OpenAI, Zyte, SerpAPI, DataForSEO) are automatically cached with a default 24-hour TTL.

**Setup:**
- Install Redis locally via docker: `docker run -d -p 6379:6379 redis:8` or use a cloud Redis instance
- Set `REDIS_USE_CACHE` in your `.env` file (defaults to `true`, switch to `false`if you do not want to use the cache)
- Set `REDIS_URL` in your `.env` file (defaults to `redis://localhost:6379/0` if not set)
- Set `REDIS_CACHE_TTL` in your `.env` file (defaults to `86400` which is 24h if not set)


**Benefits:**
- Prevents re-paying for identical API calls during development
- Supports multiple workers/processes with shared cache
- Automatic stampede protection prevents duplicate requests
- Gracefully degrades if Redis is unavailable

The cache is automatically invalidated when request parameters change, ensuring you always get fresh results for new queries.

## Usage

### Basic Configuration
For a complete working example, see `fraudcrawler/launch_demo_pipeline.py`. After setting up the necessary parameters you can launch and analyse the results with:
```python
# Run pipeline
await client.run(
    search_term=search_term,
    search_engines=search_engines,
    language=language,
    location=location,
    deepness=deepness,
    excluded_urls=excluded_urls,
)

# Load results
df = client.load_results()
print(df.head())
```

### Advanced Configuration

**Search term enrichment** - Find and search related terms:
```python
from fraudcrawler import Enrichment

deepness.enrichment = Enrichment(
    additional_terms=5,
    additional_urls_per_term=10
)
```

**Marketplace filtering** - Focus on specific domains:
```python
from fraudcrawler import Host

marketplaces = [
    Host(name="International", domains="zavamed.com,apomeds.com"),
    Host(name="National", domains="netdoktor.ch,nobelpharma.ch"),
]

await client.run(..., marketplaces=marketplaces)
```

**Exclude domains** - Exclude specific domains from your results:
```python
excluded_urls = [
    Host(name="Compendium", domains="compendium.ch"),
]

await client.run(..., excluded_urls=excluded_urls)
```

**Skip previously collected URLs**:
```python
previously_collected_urls = [
    "https://example.com/product1",
    "https://example.com/product2",
]

await client.run(..., previously_collected_urls=previously_collected_urls)
```

**View all results** from a client instance:
```python
client.print_available_results()
```

## Output

Results are saved as CSV files in `data/results/` with the naming pattern:

```
<search_term>_<language_code>_<location_code>_<timestamp>.csv
```

Example: `sildenafil_de_ch_20250115143022.csv`

The CSV includes product details, URLs, and classification scores from your workflows.

## Development

For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Architecture

Fraudcrawler uses an asynchronous pipeline where products can be at different processing stages simultaneously. Product A might be in classification while Product B is still being scraped. This is enabled by async workers for each stage (Search, Context Extraction, Processing) using `httpx.AsyncClient`.

![Async Setup](https://github.com/open-veanu/fraudcrawler/raw/master/docs/assets/images/Fraudcrawler_Async_Setup.svg)

For more details on the async design, see the [httpx documentation](https://www.python-httpx.org/api/#asyncclient).

