Metadata-Version: 2.1
Name: xcrawl
Version: 1.1.0
Summary: Official Python SDK for XCrawl - A powerful web scraping API service
Author: XCrawl Team
License: MIT
Project-URL: Homepage, https://github.com/xcrawl-api/xcrawl-sdk
Project-URL: Repository, https://github.com/xcrawl-api/xcrawl-sdk.git
Project-URL: Issues, https://github.com/xcrawl-api/xcrawl-sdk/issues
Keywords: xcrawl,web-scraping,scraper,crawler,sdk,api-client
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: urllib3>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.7.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: types-requests>=2.31.0; extra == "dev"
Requires-Dist: flake8>=6.1.0; extra == "dev"

# XCrawl Python SDK

XCrawl Python SDK provides an interface to the XCrawl API, including scraping, search, sitemap discovery, and site crawling.

## Table of Contents

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Core APIs](#core-apis)
- [Configuration](#configuration)
- [Output Formats](#output-formats)
- [Error Handling](#error-handling)
- [Requirements](#requirements)
- [Development](#development)
- [License](#license)

## Installation

Install from PyPI:

```bash
pip install xcrawl
```

Install from source:

```bash
git clone <repository-url>
cd <repo-dir>/xcrawl-sdk-py
pip install -e .
```

## Quick Start

1. Get an API key from [xcrawl.com](https://xcrawl.com).
2. Set `XCRAWL_API_KEY` or pass `api_key` directly.

```bash
export XCRAWL_API_KEY=your-api-key
```

```python
from xcrawl import XcrawlClient

client = XcrawlClient()

# Sync scrape
sync_result = client.scrape('https://example.com', {
    'output': {'formats': ['markdown']}
})
print(sync_result['data']['markdown'])

# Async scrape + auto polling
job = client.scrape('https://example.com', {
    'mode': 'async',
    'output': {'formats': ['markdown', 'summary']}
})
result = client.wait_for_job(job['scrape_id'], timeout=60)
print(result['status'])
```

## Core APIs

### Scrape

```python
result = client.scrape('https://example.com', {
    'proxy': {'location': 'US'},
    'request': {
        'device': 'desktop',
        'only_main_content': True,
        'block_ads': True,
    },
    'output': {
        'formats': ['html', 'markdown', 'links', 'screenshot'],
        'screenshot': 'full_page'
    }
})
```

### Async Status Check

```python
status = client.get_job_result('scrape-id-123')
if status['status'] == 'completed':
    print(status['data']['markdown'])
elif status['status'] == 'failed':
    print(status.get('message'))
```

### Structured Extraction (`json` format)

`output.json.prompt` and `output.json.json_schema` are both optional.

```python
result = client.scrape('https://example.com', {
    'output': {
        'formats': ['json'],
        'json': {
            'prompt': 'Extract product name and price',
            'json_schema': {
                'type': 'object',
                'properties': {
                    'name': {'type': 'string'},
                    'price': {'type': 'number'}
                }
            }
        }
    }
})
```

### Search

```python
result = client.search({
    'query': 'web scraping',
    'location': 'New York, NY',
    'language': 'en',
    'limit': 10
})
print(result['data'])
```

`result['data']` is dynamic and may vary by backend response. Stable fields include `credits_used` and `credits_detail`.

### Map

```python
result = client.map('https://example.com', {
    'filter': r'.*\.html$',
    'limit': 1000,
    'include_subdomains': True,
    'ignore_query_parameters': True
})
print(result['data']['links'])
```

### Crawl

```python
job = client.crawl('https://example.com', {
    'crawler': {
        'limit': 100,
        'max_depth': 3,
        'include': [r'.*\.html$'],
        'exclude': [r'.*/admin/.*']
    },
    'output': {'formats': ['markdown']}
})

crawl_status = client.get_crawl_status(job['crawl_id'])
print(crawl_status['status'])
```

### Webhook (Async)

```python
client.scrape('https://example.com', {
    'mode': 'async',
    'webhook': {
        'url': 'https://your-server.com/webhook',
        'events': ['completed', 'failed']
    }
})
```

## Configuration

```python
client = XcrawlClient(
    api_key='your-api-key',
    api_url='https://run.xcrawl.com',
    timeout=60,
    max_retries=3,
    backoff_factor=0.5,
)
```

Retry behavior:
- Retries: HTTP 5xx and network failures
- No retry: HTTP 4xx and validation errors

## Output Formats

Supported values in `output.formats`:
- `markdown`
- `html`
- `raw_html`
- `links`
- `summary`
- `screenshot`
- `json`

If `output.formats` is omitted or set to `[]`, the response returns `metadata` only.

## Error Handling

```python
from xcrawl import XcrawlClient, XcrawlError, JobTimeoutError

client = XcrawlClient(api_key='your-api-key')

try:
    result = client.scrape('https://example.com', {
        'output': {'formats': ['markdown']}
    })
    print(result['data']['markdown'])
except JobTimeoutError as error:
    print(f'Job {error.job_id} timed out after {error.timeout_seconds}s')
except XcrawlError as error:
    print(error.code, error.message, error.status, error.request_id)
```

## Requirements

- Python >= 3.8

## Development

```bash
pip install -e ".[dev]"
pytest
black xcrawl
mypy xcrawl
```

## License

MIT
