Metadata-Version: 2.4
Name: omnifetch-lib
Version: 1.2.1
Summary: Universal content extraction library with tiered fetching strategies and anti-bot bypass
License: MIT
Project-URL: Homepage, https://github.com/visy-ani/omni-fetch
Project-URL: Repository, https://github.com/visy-ani/omni-fetch
Project-URL: Documentation, https://github.com/visy-ani/omni-fetch#readme
Keywords: fetch,scraper,web-scraping,content-extraction,headless,cloudflare-bypass,anti-bot
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: html2text>=2020.1.16
Requires-Dist: lxml>=4.9.0
Provides-Extra: stealth
Requires-Dist: curl_cffi>=0.6.0; extra == "stealth"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Provides-Extra: all
Requires-Dist: curl_cffi>=0.6.0; extra == "all"
Requires-Dist: pytest>=7.0.0; extra == "all"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "all"

# OmniFetch Python Library

Python implementation of OmniFetch - a universal content extraction library.

## Features

- **Universal Extraction**: Fetches content from any URL, handling standard sites, SPAs, and paywalls.
- **Tiered System**:
  1.  **Light Fetch**: Fast, standard HTTP request.
  2.  **Headless Browser**: Handles dynamic JS-heavy sites (requires Netlify endpoint).
  3.  **Search Fallback**: Finds alternative sources for paywalled or blocked content.
- **Smart Parsing**: Converts HTML to clean Markdown or JSON.

## Installation

```bash
pip install omnifetch-lib
```

## Quick Start

```python
from omnifetch import omni_fetch

# Text extraction (Markdown)
result = omni_fetch('https://example.com', mode='TEXT')
print(result.content)

# JSON extraction (Structured Data)
json_result = omni_fetch('https://example.com', mode='JSON')
print(json_result.content['title'])
```

## Configuration

```python
def omni_fetch(
    url: str,
    mode: str = 'TEXT',           # 'JSON' for structured, 'TEXT' for markdown
    timeout: int = 30,            # Request timeout in seconds
    netlify_endpoint: str = None, # Headless browser endpoint (Tier 2)
    headers: dict = None,         # Custom headers
    skip_headless: bool = False,  # Skip Tier 2
    skip_search: bool = False,    # Skip Tier 3
    force_title: str = None       # Override title for search fallback
) -> OmniFetchResult
```

### Advanced Usage

#### Handling Blocked Domains (e.g., X/Twitter)

Some domains block direct scraping. OmniFetch automatically handles this by falling back to search (Tier 3). For opaque URLs, you can provide a `force_title` to improve search results.

```python
result = omni_fetch(
    'https://x.com/someuser/status/12345',
    mode='TEXT',
    force_title='Specific Tweet Content Title' # Helps find the content via search
)
```

#### Headless Browser Support

To enable Tier 2 (Headless Browser) for dynamic sites, you need to deploy the provided Netlify function and pass the endpoint.

```python
result = omni_fetch(
    'https://dynamic-site.com',
    netlify_endpoint='https://your-site.netlify.app/.netlify/functions/headless-fetch'
)
```

## Development Installation

```bash
pip install -e .
```

## Running Tests

```bash
pip install -e ".[dev]"
pytest
```

See the main [README.md](../../README.md) for full documentation.
