Metadata-Version: 2.4
Name: screenshot-client-batch
Version: 0.1.0
Summary: Screenshot service client with polling, batch orchestration, and CLI for Azure screenshot service.
Author: Screenshot Package maintainers
Maintainer: Screenshot Package maintainers
License-Expression: MIT
Project-URL: Homepage, https://github.com/pj-ms/screenshot-service
Project-URL: Repository, https://github.com/pj-ms/screenshot-service
Project-URL: Issues, https://github.com/pj-ms/screenshot-service/issues
Keywords: screenshot,batch,azure,orchestration,bulk,web-scraping,client,api
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: infra-screenshot>=0.1.3
Requires-Dist: screenshot-client-core>=0.2.0
Requires-Dist: infra-core>=0.4.0
Requires-Dist: httpx>=0.24.0
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.12; extra == "azure"
Requires-Dist: azure-identity>=1.17; extra == "azure"
Requires-Dist: azure-data-tables>=12.4; extra == "azure"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: mypy>=1.13; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"

# Screenshot Client Batch

[![PyPI version](https://badge.fury.io/py/screenshot-client-batch.svg)](https://pypi.org/project/screenshot-client-batch/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

Complete screenshot service client and batch orchestration package. Provides both a
low-level API client (`ScreenshotServiceClient`) and high-level coordinators for
batch processing of screenshot jobs via the Azure-hosted screenshot service.

Built on top of the auto-generated `screenshot-client-core`, this package adds:

- Polling with configurable timeout/interval (using `infra_core.polling`)
- Environment-based configuration (`from_env()`)
- Custom exception hierarchy for better error handling
- Batch coordination with chunking, parallel execution, and progress tracking
- CLI tool for processing JSONL files

## Installation

```bash
pip install screenshot-client-batch
```

## Requirements

- Python 3.11+
- Access to a screenshot service endpoint (Azure Function)

### Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `SCREENSHOT_FUNC_BASE_URL` | Yes | Base URL of the screenshot service |
| `SCREENSHOT_FUNC_API_KEY` | No | API key for authenticated endpoints |
| `SCREENSHOT_AZURE_STORAGE_CONNECTION_STRING` | No | Azure Storage connection for upload/download |

## Quick Start: API Client

```python
from screenshot_batch import ScreenshotServiceClient

# Create client from environment variables
client = ScreenshotServiceClient.from_env()  # Uses SCREENSHOT_FUNC_BASE_URL, SCREENSHOT_FUNC_API_KEY

# Start a batch
handle = client.start_batch(
    "my-batch",
    {"jobs": [{"job_id": "1", "url": "https://example.com"}]}
)

# Poll until completion
result = client.poll_run(handle.batch_id, handle.job_id, timeout=600)
print(f"Status: {result['status']}")

# Get results
manifest = client.get_result(handle.batch_id, handle.job_id)
```

## Batch Orchestration

For single batch runs with automatic polling and optional artifact download:

```python
from pathlib import Path

from screenshot import ScreenshotOptions, CaptureOptions
from screenshot_batch import ScreenshotBatchCoordinator, ScreenshotJobSpec

coordinator = ScreenshotBatchCoordinator(
    base_url="https://site-screenshot.azurewebsites.net",
    batch_id="marketing-refresh",
    store_dir=Path("data/site-screens"),
    download_results=True,
)

jobs = [
    ScreenshotJobSpec.from_url(
        "https://example.com",
        options=ScreenshotOptions(
            capture=CaptureOptions(enabled=True, max_pages=3, depth=1),
        ),
    ),
]

result = coordinator.run_batch_sync(jobs)
print(result.status, result.job_completed)
```

## Chunked Runs

For large collections (e.g., 4,000 URLs), use `ScreenshotMultiRunCoordinator` to fan out
multiple API calls while keeping results grouped in per-run folders:

```python
from screenshot_batch import (
    ScreenshotMultiRunCoordinator,
    ScreenshotJobSpec,
)

runner = ScreenshotMultiRunCoordinator(
    base_url="https://site-screenshot.azurewebsites.net",
    batch_id="yc-companies",
    store_dir=Path("data/site-screens"),
    download_results=True,
    upload_results=True,  # Optional: upload to Azure Storage
)

specs = [
    ScreenshotJobSpec.from_url("https://example.com", options=my_options),
    # ... more specs
]

summaries = runner.run_jobs(specs, chunk_size=4, max_parallel_runs=4)
for summary in summaries:
    print(summary.chunk_index, summary.result.status, summary.result.job_failed)
```

## CLI

Install the package and run the CLI to process a JSONL file in chunks:

```bash
screenshot-client-batch input.jsonl my-batch-id \
  --base-url https://site-screenshot.azurewebsites.net \
  --chunk-size 4 \
  --max-parallel-runs 32 \
  --per-run-concurrency 4 \
  --store-dir data/screenshots \
  --download-results \
  --upload-results \
  --state-path data/state.json \
  --metadata '{"source":"my-project"}' \
  --output data/results.jsonl
```

### CLI Options

| Option | Default | Description |
|--------|---------|-------------|
| `--base-url` | `$SCREENSHOT_FUNC_BASE_URL` | Screenshot service URL |
| `--chunk-size` | 4 | Jobs per API call |
| `--max-parallel-runs` | 32 | Concurrent API calls |
| `--per-run-concurrency` | 4 | Concurrency per run |
| `--store-dir` | None | Directory for downloaded artifacts |
| `--download-results` | False | Download artifacts after completion |
| `--upload-results` | False | Upload artifacts to Azure Storage |
| `--state-path` | None | JSON file for resumable state |
| `--metadata` | None | JSON metadata merged into all jobs |
| `--output` | `{input}.results.jsonl` | Output file path |
| `--sample N` | None | Limit to first N jobs (for testing) |

### Input Format

The input JSONL file should contain one JSON object per line:

```json
{"url": "https://example.com", "job_id": "site-1", "metadata": {"category": "tech"}}
{"url": "https://another.com", "job_id": "site-2"}
```

### Output Format

The output JSONL mirrors the input with added status fields:

```json
{"url": "https://example.com", "job_id": "site-1", "status": "succeeded", "blob_url": "https://..."}
{"url": "https://another.com", "job_id": "site-2", "status": "failed", "errors": ["timeout"]}
```

## Exception Handling

The package provides a custom exception hierarchy:

```python
from screenshot_batch import (
    ScreenshotServiceError,  # Base exception
    ScreenshotConfigError,   # Configuration/environment errors
    ScreenshotAPIError,      # API communication errors
    ScreenshotTimeoutError,  # Polling timeout
)

try:
    result = client.poll_run(batch_id, job_id, timeout=60)
except ScreenshotTimeoutError:
    print("Batch did not complete in time")
except ScreenshotAPIError as e:
    print(f"API error: {e}")
```

## Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Type checking
mypy src/screenshot_batch

# Linting
ruff check src/ tests/
ruff format src/ tests/
```

## License

MIT
