Metadata-Version: 2.4
Name: meter-sdk
Version: 0.7.2
Summary: Python SDK for Meter Scraper API
Author: Meter
Project-URL: Homepage, https://meter.sh
Project-URL: Documentation, https://docs.meter.sh/
Keywords: scraping,api,sdk
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: httpx>=0.24.0
Dynamic: requires-python

# Meter Scraper API SDK

Python SDK for the [Meter Scraper API](https://api.meter.sh) - a powerful web scraping service with LLM-powered strategy generation, job execution, and scheduling.

## Features

- **Simple API**: Clean, Pythonic interface for all API operations
- **LLM-Powered Strategies**: Generate extraction strategies using natural language descriptions
- **API-Based Scraping**: Capture underlying APIs with `force_api` for dynamic sites
- **Strategy Refinement**: Iteratively improve strategies with feedback
- **Job Execution**: Run scrapes with saved strategies (no LLM costs on execution)
- **API Parameters**: Override parameters at runtime for API-based strategies
- **Batch Jobs**: Scrape multiple URLs in a single request
- **Content Analysis**: Track changes with content hashing, structural signatures, and semantic similarity
- **Scheduling**: Set up recurring scrapes with interval or cron expressions
- **Keyword Filtering**: Filter change results with Lucene-style syntax
- **Error Handling**: Comprehensive error handling with custom exceptions
- **Type Hints**: Full type annotations for better IDE support

## Installation

```bash
pip install meter-sdk
```

Or install from source:

```bash
git clone https://github.com/reverse/meter-sdk
cd meter-sdk
pip install -e .
```

## Quick Start

```python
from meter_sdk import MeterClient

# Initialize client with your API key
client = MeterClient(api_key="sk_live_")

# Generate a strategy using LLM
result = client.generate_strategy(
    url="https://example.com/products",
    description="Extract product names and prices",
    name="Product Scraper"
)

strategy_id = result["strategy_id"]
print(f"Generated strategy: {strategy_id}")
print(f"Preview data: {result['preview_data']}")

# Create and run a scrape job
job = client.create_job(
    strategy_id=strategy_id,
    url="https://example.com/products"
)

# Wait for job to complete (automatically polls)
completed_job = client.wait_for_job(job["job_id"])
results = completed_job["results"]

print(f"Scraped {len(results)} items")
for item in results:
    print(item)
```

## Authentication

### Getting an API Key

The SDK uses API key authentication. API keys are created on the frontend using Supabase Auth. Once you have an API key (starts with `sk_live_`), use it to initialize the client:

```python
from meter_sdk import MeterClient
import os

# Load from environment variable (recommended)
api_key = os.getenv("METER_API_KEY")
client = MeterClient(api_key=api_key)

# Or use directly
client = MeterClient(api_key="sk_live_")
```

```python
from meter_sdk import MeterClient
import os

# Load from environment variable (recommended)
api_key = os.getenv("METER_API_KEY")
client = MeterClient(api_key=api_key)

# Or use directly
client = MeterClient(api_key="sk_live_")
```

## Core Concepts

### Strategies

A **strategy** is an extraction plan generated by the LLM that tells the scraper how to extract data from a webpage. Strategies are reusable - once created, you can run multiple jobs with the same strategy without incurring LLM costs.

### Jobs

A **job** is a single execution of a scrape using a strategy. Jobs run asynchronously and can be polled for status and results.

### Schedules

A **schedule** automatically runs jobs at specified intervals or cron times, making it easy to monitor websites for changes.

## Usage Guide

### Strategy Management

#### Generate a Strategy

Generate a new extraction strategy using natural language:

```python
result = client.generate_strategy(
    url="https://example.com/products",
    description="Extract product names, prices, and descriptions",
    name="E-commerce Product Scraper"
)

# Response includes:
# - strategy_id: UUID of the created strategy
# - strategy: The extraction strategy (JSON)
# - preview_data: Sample extracted data
# - attempts: Number of LLM attempts (usually 1)

strategy_id = result["strategy_id"]
print(f"Strategy created: {strategy_id}")
print(f"Preview: {result['preview_data']}")
```

The LLM uses a two-stage approach:

1. **Haiku analysis**: Quick analysis of the page structure
2. **Sonnet generation**: Detailed strategy creation

#### Refine a Strategy

If the initial strategy doesn't capture everything you need, refine it with feedback:

```python
# First, check the preview data
result = client.generate_strategy(...)

# If something is missing, refine it
refined = client.refine_strategy(
    strategy_id=result["strategy_id"],
    feedback="The strategy is missing the product images. Also, extract the SKU field."
)

# The refined strategy uses cached HTML (no re-fetching)
# You can refine multiple times
refined_again = client.refine_strategy(
    strategy_id=result["strategy_id"],
    feedback="The price should include the currency symbol"
)
```

Refinement is fast and cost-effective because it uses cached HTML from the initial generation.

#### List Strategies

```python
# Get all strategies
strategies = client.list_strategies(limit=20, offset=0)

for strategy in strategies:
    print(f"{strategy['name']}: {strategy['strategy_id']}")
    print(f"  URL: {strategy['url']}")
    print(f"  Created: {strategy['created_at']}")
```

#### Get Strategy Details

```python
strategy = client.get_strategy(strategy_id)

print(f"Name: {strategy['name']}")
print(f"Description: {strategy['description']}")
print(f"Preview data: {strategy['preview_data']}")
print(f"Attempts: {strategy['attempts']}")
```

#### Delete a Strategy

```python
client.delete_strategy(strategy_id)
```

### Job Execution

#### Create a Job

Create a scrape job using an existing strategy:

```python
job = client.create_job(
    strategy_id="your-strategy-uuid",
    url="https://example.com/products"
)

job_id = job["job_id"]
status = job["status"]  # "pending"
```

Jobs run asynchronously in the background. No LLM costs are incurred during job execution - the strategy is reused.

#### Check Job Status

```python
job = client.get_job(job_id)

print(f"Status: {job['status']}")  # pending, running, completed, failed

if job["status"] == "completed":
    results = job["results"]
    print(f"Scraped {job['item_count']} items")
    print(f"Content hash: {job['content_hash']}")
elif job["status"] == "failed":
    print(f"Error: {job['error']}")
```

#### Wait for Job Completion

The SDK provides a convenient method to poll a job until it completes:

```python
# Wait indefinitely (default: polls every 1 second)
completed_job = client.wait_for_job(job_id)

# With timeout (raises MeterError if timeout exceeded)
try:
    completed_job = client.wait_for_job(
        job_id,
        poll_interval=2.0,  # Check every 2 seconds
        timeout=300.0  # 5 minute timeout
    )
    results = completed_job["results"]
except MeterError as e:
    print(f"Job failed or timed out: {e}")
```

#### List Jobs

```python
# Get all jobs
all_jobs = client.list_jobs(limit=50, offset=0)

# Filter by strategy
strategy_jobs = client.list_jobs(strategy_id="your-strategy-uuid")

# Filter by status
completed_jobs = client.list_jobs(status="completed")

# Combined filters
recent_completed = client.list_jobs(
    strategy_id="your-strategy-uuid",
    status="completed",
    limit=10
)
```

#### Compare Jobs

Compare two jobs to detect changes:

```python
comparison = client.compare_jobs(job_id_1, job_id_2)

print(f"Content hash match: {comparison['content_hash_match']}")
print(f"Structural match: {comparison['structural_match']}")
print(f"Semantic similarity: {comparison['semantic_similarity']}")  # 0.0-1.0
print(f"Item count difference: {comparison['item_count_diff']}")

if comparison['structural_changes']:
    print("Structural changes detected:")
    for change in comparison['structural_changes']:
        print(f"  - {change}")
```

#### Get Strategy History

Get a timeline of all jobs for a strategy:

```python
history = client.get_strategy_history(strategy_id)

for entry in history:
    print(f"Job {entry['job_id']}: {entry['status']}")
    print(f"  Items: {entry['item_count']}")
    print(f"  Has changes: {entry['has_changes']}")
    print(f"  Created: {entry['created_at']}")
```

The `has_changes` field indicates if content changed compared to the previous job.

### Schedule Management

#### Create a Schedule (Interval)

Run a scrape at regular intervals:

```python
# Run every hour (3600 seconds)
schedule = client.create_schedule(
    strategy_id="your-strategy-uuid",
    url="https://example.com/products",
    interval_seconds=3600
)

print(f"Schedule ID: {schedule['schedule_id']}")
print(f"Next run: {schedule['next_run_at']}")
```

#### Create a Schedule (Cron)

Use cron expressions for more complex schedules:

```python
# Run daily at 9 AM
schedule = client.create_schedule(
    strategy_id="your-strategy-uuid",
    url="https://example.com/products",
    cron_expression="0 9 * * *"
)

# Run every weekday at 8 AM
schedule = client.create_schedule(
    strategy_id="your-strategy-uuid",
    url="https://example.com/products",
    cron_expression="0 8 * * 1-5"
)
```

#### Create a Schedule with Webhook

You can optionally provide a webhook URL to receive scrape results:

```python
# Create schedule with webhook for receiving results
schedule = client.create_schedule(
    strategy_id="your-strategy-uuid",
    url="https://example.com/products",
    interval_seconds=3600,
    webhook_url="https://your-app.com/webhooks/scrape-results"
)
```

#### List Schedules

```python
schedules = client.list_schedules()

for schedule in schedules:
    print(f"{schedule['schedule_id']}: {schedule['schedule_type']}")
    print(f"  Enabled: {schedule['enabled']}")
    print(f"  Next run: {schedule['next_run_at']}")
```

#### Update a Schedule

```python
# Disable a schedule
client.update_schedule(schedule_id, enabled=False)

# Change the interval
client.update_schedule(
    schedule_id,
    interval_seconds=7200  # Every 2 hours
)

# Change to cron expression
client.update_schedule(
    schedule_id,
    cron_expression="0 10 * * *"  # Daily at 10 AM
)

# Update webhook URL
client.update_schedule(
    schedule_id,
    webhook_url="https://your-new-webhook-url.com/results"
)
```

#### Delete a Schedule

```python
client.delete_schedule(schedule_id)
```

## Complete Workflow Examples

### Example 1: API-Based Scraping with Parameters

For sites that load data via JavaScript APIs, use `force_api=True` to capture the underlying API:

```python
from meter_sdk import MeterClient

client = MeterClient(api_key="sk_live_...")

# Generate strategy with API capture
strategy = client.generate_strategy(
    url="https://jobs.example.com/listings",
    description="Extract job titles, companies, salaries, and locations",
    name="Job Listings API",
    force_api=True  # Force API-based capture
)

# Check the scraper type and available parameters
print(f"Scraper type: {strategy['scraper_type']}")  # 'api' or 'css'
if strategy.get('api_parameters'):
    print(f"Available parameters: {strategy['api_parameters']}")
    # e.g., {'page': 1, 'limit': 20, 'category': 'all', 'location': 'remote'}

# Run job with custom parameters
job = client.create_job(
    strategy_id=strategy["strategy_id"],
    url="https://jobs.example.com/api/listings",
    parameters={
        "category": "engineering",
        "location": "remote",
        "limit": 100
    }
)

results = client.wait_for_job(job["job_id"])
print(f"Found {results['item_count']} matching jobs")
```

### Example 2: E-commerce Product Monitoring

```python
from meter_sdk import MeterClient
import os

client = MeterClient(api_key=os.getenv("METER_API_KEY"))

# Step 1: Generate strategy
strategy = client.generate_strategy(
    url="https://example-store.com/products",
    description="Extract product name, price, availability status, and product URL",
    name="Product Monitor"
)

strategy_id = strategy["strategy_id"]
print(f"Strategy created: {strategy_id}")

# Step 2: Run initial scrape
job = client.create_job(strategy_id, "https://example-store.com/products")
initial_results = client.wait_for_job(job["job_id"])

print(f"Initial scrape: {initial_results['item_count']} products")

# Step 3: Set up daily monitoring
schedule = client.create_schedule(
    strategy_id=strategy_id,
    url="https://example-store.com/products",
    cron_expression="0 9 * * *"  # Daily at 9 AM
)

print(f"Monitoring schedule created: {schedule['schedule_id']}")

# Step 4: Check for changes later
history = client.get_strategy_history(strategy_id)
if len(history) > 1:
    latest = history[0]
    previous = history[1]

    if latest["has_changes"]:
        print("Changes detected!")
        comparison = client.compare_jobs(latest["job_id"], previous["job_id"])
        print(f"Semantic similarity: {comparison['semantic_similarity']}")
```

### Example 2: News Article Scraping

```python
from meter_sdk import MeterClient

client = MeterClient(api_key="sk_live_...")

# Generate strategy for news articles
strategy = client.generate_strategy(
    url="https://news.example.com/latest",
    description="Extract article headlines, authors, publication dates, and article URLs",
    name="News Scraper"
)

# Refine to include article summaries
refined = client.refine_strategy(
    strategy_id=strategy["strategy_id"],
    feedback="Also extract the article summary/excerpt if available"
)

# Run scrape
job = client.create_job(
    strategy_id=strategy["strategy_id"],
    url="https://news.example.com/latest"
)

results = client.wait_for_job(job["job_id"])["results"]

for article in results:
    print(f"{article['headline']} by {article['author']}")
    print(f"  Published: {article['publication_date']}")
    print(f"  URL: {article['url']}")
```

### Example 3: Real Estate Listings

```python
from meter_sdk import MeterClient

client = MeterClient(api_key="sk_live_...")

# Create strategy
strategy = client.generate_strategy(
    url="https://realestate.example.com/listings",
    description="Extract property address, price, bedrooms, bathrooms, square footage, and listing URL",
    name="Real Estate Monitor"
)

# Set up hourly monitoring
schedule = client.create_schedule(
    strategy_id=strategy["strategy_id"],
    url="https://realestate.example.com/listings",
    interval_seconds=3600  # Every hour
)

# Check results periodically
jobs = client.list_jobs(
    strategy_id=strategy["strategy_id"],
    status="completed",
    limit=10
)

for job_data in jobs:
    job = client.get_job(job_data["id"])
    print(f"Scrape at {job['completed_at']}: {job['item_count']} listings")
```

## Error Handling

The SDK raises `MeterError` for all API errors:

```python
from meter_sdk import MeterClient, MeterError

client = MeterClient(api_key="sk_live_...")

try:
    strategy = client.generate_strategy(
        url="https://example.com",
        description="Extract data",
        name="Test"
    )
except MeterError as e:
    print(f"API error: {e}")
    # Handle error (invalid API key, rate limit, etc.)

try:
    job = client.wait_for_job(job_id, timeout=60.0)
except MeterError as e:
    print(f"Job error: {e}")
    # Handle timeout or job failure
```

Common error scenarios:

- **401 Unauthorized**: Invalid or missing API key
- **400 Bad Request**: Invalid request parameters
- **404 Not Found**: Resource doesn't exist
- **500 Internal Server Error**: Server-side error

## Advanced Usage

### Context Manager

The client can be used as a context manager for automatic cleanup:

```python
with MeterClient(api_key="sk_live_...") as client:
    strategies = client.list_strategies()
    # Client automatically closes HTTP connections
```

### Custom Base URL

For development or custom deployments:

```python
client = MeterClient(
    api_key="sk_live_...",
    base_url="http://localhost:8000"  # Local development
)
```

### Pagination

For endpoints that support pagination:

```python
# List strategies with pagination
offset = 0
limit = 20
all_strategies = []

while True:
    strategies = client.list_strategies(limit=limit, offset=offset)
    if not strategies:
        break
    all_strategies.extend(strategies)
    offset += limit
```

## API Reference

### MeterClient

Main client class for interacting with the API.

#### Constructor

```python
MeterClient(api_key: str, base_url: str = "https://api.meter.sh")
```

#### Strategy Methods

- `generate_strategy(url: str, description: str, name: str, force_api: bool = False) -> Dict`
- `refine_strategy(strategy_id: str, feedback: str) -> Dict`
- `list_strategies(limit: int = 20, offset: int = 0) -> List[Dict]`
- `get_strategy(strategy_id: str) -> Dict`
- `delete_strategy(strategy_id: str) -> Dict`

#### Job Methods

- `create_job(strategy_id: str, url: Optional[str] = None, urls: Optional[List[str]] = None, parameters: Optional[Dict] = None) -> Dict`
- `execute_job(strategy_id: str, url: str, parameters: Optional[Dict] = None) -> Dict`
- `get_job(job_id: str) -> Dict`
- `list_jobs(strategy_id: Optional[str] = None, status: Optional[str] = None, limit: int = 20, offset: int = 0) -> List[Dict]`
- `wait_for_job(job_id: str, poll_interval: float = 1.0, timeout: Optional[float] = None) -> Dict`
- `compare_jobs(job_id: str, other_job_id: str) -> Dict`
- `get_strategy_history(strategy_id: str) -> List[Dict]`

#### Schedule Methods

- `create_schedule(strategy_id: str, url: Optional[str] = None, urls: Optional[List[str]] = None, interval_seconds: Optional[int] = None, cron_expression: Optional[str] = None, webhook_url: Optional[str] = None, parameters: Optional[Dict] = None) -> Dict`
- `list_schedules() -> List[Dict]`
- `update_schedule(schedule_id: str, enabled: Optional[bool] = None, url: Optional[str] = None, urls: Optional[List[str]] = None, interval_seconds: Optional[int] = None, cron_expression: Optional[str] = None, webhook_url: Optional[str] = None, parameters: Optional[Dict] = None) -> Dict`
- `delete_schedule(schedule_id: str) -> Dict`
- `get_schedule_changes(schedule_id: str, mark_seen: bool = True, filter: Optional[str] = None) -> Dict`

### MeterError

Exception raised for all API errors.

```python
class MeterError(Exception):
    """Base exception for Meter SDK errors"""
    pass
```

## Response Formats

All methods return dictionaries matching the API response format. See the [API documentation](https://api.meter.sh/docs) for detailed response schemas.

Key response fields:

- **Strategy responses**: `strategy_id`, `strategy`, `preview_data`, `attempts`, `scraper_type` ('css' or 'api'), `api_parameters` (for API strategies)
- **Job responses**: `job_id`, `status`, `results`, `item_count`, `content_hash`, `structural_signature`, `parameters` (if API strategy)
- **Schedule responses**: `id`, `strategy_id`, `url`, `urls`, `schedule_type`, `interval_seconds`, `cron_expression`, `enabled`, `webhook_url`, `parameters`, `next_run_at`, `last_run_at`, `created_at`, `updated_at`

## Best Practices

1. **Store API keys securely**: Use environment variables or secure storage, never hardcode
2. **Handle errors gracefully**: Always wrap API calls in try/except blocks
3. **Use timeouts**: Set appropriate timeouts for `wait_for_job()` to avoid hanging
4. **Reuse strategies**: Generate once, use many times to avoid LLM costs
5. **Monitor schedules**: Regularly check schedule status and job history
6. **Use context managers**: Use `with` statement for automatic resource cleanup
7. **Poll efficiently**: Use appropriate `poll_interval` values for `wait_for_job()`

## Troubleshooting

### Connection Errors

If you see connection errors, check:

- API key is valid and not expired
- Base URL is correct (default: `https://api.meter.sh`)
- Network connectivity

### Job Timeouts

If jobs frequently timeout:

- Check if the target URL is accessible
- Verify the strategy is correct
- Check API logs for errors

### Strategy Generation Fails

If strategy generation fails:

- Ensure the URL is accessible
- Provide clear, specific descriptions
- Check API logs for LLM errors

## License

MIT

## Support

For API documentation and interactive testing, visit [https://docs.meter.sh/](https://docs.meter.sh/)
