Metadata-Version: 2.4
Name: azcrawlerpy
Version: 0.4.7
Summary: Agentic Crawler Discovery Framework.
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: <3.12,>=3.11.10
Description-Content-Type: text/markdown
Requires-Dist: playwright==1.58.0
Requires-Dist: pydantic>=2.11.10
Requires-Dist: camoufox[geoip]==0.4.11
Provides-Extra: dev
Requires-Dist: pytest==8.4.2; extra == "dev"
Requires-Dist: pytest-asyncio==1.0.0; extra == "dev"
Requires-Dist: pytest-mock==3.14.1; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pyright>=1.1.390; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"

# azcrawlerpy

A framework for navigating and filling multi-step web forms programmatically. Supports Camoufox anti-detect browser (built on Firefox with C++ level fingerprint spoofing) and standard Chromium via Playwright. Uses JSON instruction files to define form navigation workflows, making it ideal for automated form submission, web scraping, and AI agent-driven web interactions.

## Table of Contents

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Core Concepts](#core-concepts)
- [Instructions Schema](#instructions-schema)
  - [Top-Level Structure](#top-level-structure)
  - [Browser Configuration](#browser-configuration)
  - [Cookie Consent Handling](#cookie-consent-handling)
  - [Step Definitions](#step-definitions)
  - [Field Types](#field-types)
  - [Optional Fields](#optional-fields)
  - [Action Types](#action-types)
  - [Final Page Configuration](#final-page-configuration)
  - [Data Extraction Configuration](#data-extraction-configuration)
- [Data Points (input_data)](#data-points-input_data)
- [Element Discovery](#element-discovery)
- [AI Agent Guidance](#ai-agent-guidance)
- [Retry and Resilience](#retry-and-resilience)
- [Error Handling and Diagnostics](#error-handling-and-diagnostics)
- [Browser Profile Building (Profiler)](#browser-profile-building-profiler)
- [Examples](#examples)

## Installation

```bash
uv add azcrawlerpy
```

Or install from source:

```bash
uv pip install -e .
```

## Quick Start

```python
import asyncio
from pathlib import Path
from azcrawlerpy import FormCrawler, DebugMode, CrawlerBrowserConfig, HumanizeConfig, RetryConfig

async def main():
    # CrawlerBrowserConfig controls runtime browser settings (proxy, stealth, humanize)
    browser_config = CrawlerBrowserConfig(
        humanize=HumanizeConfig(enabled=True),
    )
    crawler = FormCrawler(headless=True, browser_config=browser_config)

    instructions = {
        "url": "https://example.com/form",
        "browser_config": {
            "browser_type": "camoufox",
            "viewport_width": 1920,
            "viewport_height": 1080
        },
        "steps": [
            {
                "name": "step_1",
                "wait_for": "input[name='email']",
                "timeout_ms": 15000,
                "fields": [
                    {
                        "type": "text",
                        "selector": "input[name='email']",
                        "data_key": "email"
                    }
                ],
                "next_action": {
                    "type": "click",
                    "selector": "button[type='submit']"
                }
            }
        ],
        "final_page": {
            "wait_for": ".success-message",
            "timeout_ms": 60000
        }
    }

    input_data = {
        "email": "user@example.com"
    }

    result = await crawler.crawl(
        url=instructions["url"],
        input_data=input_data,
        instructions=instructions,
        output_dir=Path("./output"),
        debug_mode=DebugMode.ALL,
    )

    print(f"Final URL: {result.final_url}")
    print(f"Steps completed: {result.steps_completed}")
    print(f"Screenshot saved: {result.screenshot_path}")
    print(f"Extracted data: {result.extracted_data}")

asyncio.run(main())
```

## Core Concepts

The framework operates on two primary inputs:

1. **Instructions (instructions.json)**: Defines the form structure, selectors, navigation flow, and field types
2. **Data Points (input_data)**: Contains the actual values to fill into form fields

The crawler processes each step sequentially:
1. Wait for the step's `wait_for` selector to become visible
2. Fill all fields defined in the step using values from `input_data`
3. Execute the `next_action` to navigate to the next step
4. Repeat until all steps are complete
5. Wait for and capture the final page

## Instructions Schema

### Top-Level Structure

```json
{
  "url": "https://example.com/form",
  "browser_config": { ... },
  "cookie_consent": { ... },
  "steps": [ ... ],
  "final_page": { ... },
  "data_extraction": { ... }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | Starting URL for the form |
| `browser_config` | object | Yes | Browser engine, viewport, and user agent settings |
| `cookie_consent` | object | No | Cookie banner handling configuration |
| `captcha` | object | No | CAPTCHA handling configuration (e.g., Cloudflare Turnstile) |
| `steps` | array | Yes | Ordered list of form steps |
| `final_page` | object | Yes | Configuration for the result page |
| `data_extraction` | object | No | Configuration for extracting data from final page |
| `profiler` | object | No | Browser profile building configuration (visit sites to accumulate cookies before crawling) |

### Browser Configuration

```json
{
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080,
    "user_agent": "Mozilla/5.0 ..."
  }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `browser_type` | string | Yes | Browser engine: `camoufox` (anti-detect Firefox) or `chromium` (standard Playwright) |
| `viewport_width` | integer | Yes | Browser viewport width in pixels |
| `viewport_height` | integer | Yes | Browser viewport height in pixels |
| `user_agent` | string | No | Custom user agent string |
| `blocked_url_patterns` | array | No | URL glob patterns to block via `page.route()` (e.g., `**/analytics/**`) |

### Cookie Consent Handling

The framework supports two modes for handling cookie consent banners:

**Standard Mode** (regular DOM elements):
```json
{
  "cookie_consent": {
    "banner_selector": "dialog:has-text('cookies')",
    "accept_selector": "button:has-text('Accept')"
  }
}
```

**Shadow DOM Mode** (for Usercentrics, OneTrust, etc.):
```json
{
  "cookie_consent": {
    "banner_selector": "#usercentrics-cmp-ui",
    "shadow_host_selector": "#usercentrics-cmp-ui",
    "accept_button_texts": ["Accept All", "Alle akzeptieren"]
  }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `banner_selector` | string | Yes | CSS selector for the banner container |
| `accept_selector` | string | No | CSS selector for accept button (standard mode) |
| `shadow_host_selector` | string | No | CSS selector for shadow DOM host |
| `accept_button_texts` | array | No | Text patterns to match accept buttons in shadow DOM |
| `banner_settle_delay_ms` | integer | No | Wait time before checking for banner |
| `banner_visible_timeout_ms` | integer | No | Timeout for banner visibility |
| `accept_button_timeout_ms` | integer | No | Timeout for accept button visibility |
| `post_consent_delay_ms` | integer | No | Wait time after handling consent |
| `js_fallback_texts` | array | No | Custom text patterns for JS fallback button matching (overrides defaults) |

The JS fallback matches button text against: `klar`, `akzept`, `accept`, `agree`, `ok`, `verstanden`, `zustimm`. Set `js_fallback_texts` to override this list with site-specific patterns.

### Step Definitions

Each step represents a form page or section:

```json
{
  "name": "personal_info",
  "wait_for": "input[name='firstName']",
  "timeout_ms": 15000,
  "fields": [ ... ],
  "next_action": { ... }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | Yes | Unique identifier for the step |
| `wait_for` | string | Yes | CSS selector to wait for before processing |
| `timeout_ms` | integer | Yes | Timeout in milliseconds for wait condition |
| `fields` | array | Yes | List of field definitions |
| `next_action` | object | Yes | Action to navigate to next step |
| `data_extraction` | array | No | Data to extract BEFORE field handling in this step |
| `post_field_extraction` | array | No | Data to extract AFTER field handling (for modal results, dynamic values) |
| `optional` | boolean | No | Skip step gracefully if wait_for selector is not found (see [Optional Steps](#optional-steps)) |

### Optional Steps

Steps marked with `"optional": true` are skipped gracefully when the `wait_for` selector is not found within `timeout_ms`. The entire step (fields, data extraction, next_action) is skipped with an info log. This is useful for conditional workflow pages that only appear depending on prior input values or dynamic form behavior.

```json
{
  "name": "additional_driver_details",
  "wait_for": "#additional-driver-form",
  "timeout_ms": 5000,
  "optional": true,
  "fields": [ ... ],
  "next_action": { "type": "click", "selector": "#next" }
}
```

The `optional` check takes precedence over the `strict` parameter. When a step is optional and the wait_for selector times out, the step is always skipped regardless of strict mode. Non-optional steps follow the existing behavior: raise `CrawlerTimeoutError` when `strict=True`, or log a warning and continue into the step when `strict=False`.

**Tip:** Use a shorter `timeout_ms` (e.g., 3000-5000ms) on optional steps to avoid waiting the full timeout when the step is absent.

### Field Types

#### TEXT

For text inputs, email fields, phone numbers, and similar:

```json
{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email"
}
```

#### TEXTAREA

For multi-line text areas:

```json
{
  "type": "textarea",
  "selector": "textarea[name='message']",
  "data_key": "message"
}
```

#### DROPDOWN / SELECT

For native `<select>` elements:

```json
{
  "type": "dropdown",
  "selector": "select[name='country']",
  "data_key": "country",
  "type_config": {
    "select_by": "text"
  }
}
```

| type_config Parameter | Values | Description |
|-----------------------|--------|-------------|
| `select_by` | `text`, `value`, `index` | How to match the option |
| `option_visible_timeout_ms` | integer | Timeout in ms for option visibility |

#### RADIO

For radio button groups:

```json
{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_method"
}
```

**Pattern A - Value-driven selector**: Use `${value}` placeholder in selector, data provides the value:
```json
{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "gender"
}
// data: { "gender": "male" }
```

**Pattern B - Boolean flags**: Use explicit selectors with boolean data values:
```json
{
  "type": "radio",
  "selector": "[role='radio']:has-text('Yes')",
  "data_key": "accept_terms",
  "force_click": true
}
// data: { "accept_terms": true }  // clicks if truthy, skips if null
```

`force_click` is supported on `radio`, `click_only`, and `click_select` field types. When set, the click fallback chain tries force click before normal click.

#### CHECKBOX

For checkbox inputs:

```json
{
  "type": "checkbox",
  "selector": "input[type='checkbox'][name='newsletter']",
  "data_key": "subscribe_newsletter"
}
```

Data value `true` checks the box, `false` unchecks it, and `null` skips the field entirely (no interaction).

#### DATE

For date inputs with format conversion:

```json
{
  "type": "date",
  "selector": "input[name='birthdate']",
  "data_key": "birthdate",
  "type_config": {
    "format": "DD.MM.YYYY"
  }
}
```

Supported formats (mapped to strftime internally):

| Format | Example | Description |
|--------|---------|-------------|
| `DD.MM.YYYY` | 15.06.1985 | Day.Month.Year |
| `MM/DD/YYYY` | 06/15/1985 | Month/Day/Year |
| `YYYY-MM-DD` | 1985-06-15 | ISO format |
| `DD/MM/YYYY` | 15/06/1985 | Day/Month/Year |
| `YYYY/MM/DD` | 1985/06/15 | Year/Month/Day |
| `DD-MM-YYYY` | 15-06-1985 | Day-Month-Year |
| `MM-DD-YYYY` | 06-15-1985 | Month-Day-Year |

Data must be provided in ISO format (`YYYY-MM-DD`) in `input_data`. The `type_config.format` specifies the output format for typing into the field. Native `<input type="date">` fields are auto-detected and use the value as-is, no `type_config` needed.

#### SLIDER

For range inputs:

```json
{
  "type": "slider",
  "selector": "input[type='range'][name='coverage']",
  "data_key": "coverage_amount"
}
```

#### FILE

For file upload fields:

```json
{
  "type": "file",
  "selector": "input[type='file']",
  "data_key": "document_path"
}
```

Data value should be the absolute file path.

#### COMBOBOX

For autocomplete/typeahead inputs:

```json
{
  "type": "combobox",
  "selector": "input[aria-label='City']",
  "data_key": "city",
  "type_config": {
    "option_selector": ".autocomplete-option",
    "type_delay_ms": 50,
    "wait_after_type_ms": 500,
    "press_enter": true
  }
}
```

| type_config Parameter | Description |
|-----------------------|-------------|
| `option_selector` | CSS selector for dropdown options (required) |
| `type_delay_ms` | Delay between keystrokes (simulates human typing) |
| `wait_after_type_ms` | Wait time for options to appear |
| `press_enter` | Press Enter after selecting option |
| `clear_before_type` | Clear field before typing |
| `option_visible_timeout_ms` | Timeout in ms for option visibility |

#### CLICK_SELECT

For custom dropdowns requiring click-then-select:

```json
{
  "type": "click_select",
  "selector": ".custom-dropdown-trigger",
  "data_key": "option_value",
  "post_click_delay_ms": 300,
  "type_config": {
    "option_selector": ".dropdown-item:has-text('${value}')"
  }
}
```

#### CLICK_ONLY

For elements that only need clicking (no data input):

```json
{
  "type": "click_only",
  "selector": "button.expand-section"
}
```

With conditional clicking based on data:

```json
{
  "type": "click_only",
  "selector": "button:has-text('${value}')",
  "data_key": "selected_option"
}
```

#### IFRAME_FIELD

For fields inside iframes (alternative to `iframe_selector`):

```json
{
  "type": "iframe_field",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-frame",
  "data_key": "card_number"
}
```

### Common Field Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `data_key` | string | Key in input_data to get value from |
| `selector` | string | CSS/Playwright selector for the element |
| `type_config` | object | Type-specific configuration (see field type sections above) |
| `iframe_selector` | string | Selector for parent iframe if field is embedded |
| `field_visible_timeout_ms` | integer | Timeout for field to become visible |
| `post_click_delay_ms` | integer | Wait after clicking the field |
| `skip_verification` | boolean | Skip value verification after filling |
| `force_click` | boolean | Use force click to bypass overlays (`click_only`, `radio`, and `click_select`) |
| `optional` | boolean | Skip field gracefully if element is not found or interaction fails |
| `retry_config` | object | Retry configuration for transient failures (see [Retry and Resilience](#retry-and-resilience)) |

### Optional Fields

Fields marked with `"optional": true` are skipped gracefully when the element is not found on the page or when interaction fails. This is useful for elements that may or may not appear depending on dynamic page behavior, A/B tests, or conditional rendering that cannot be predicted by data alone.

```json
{
  "type": "click_only",
  "selector": "#promotional-banner button.dismiss",
  "data_key": "dismiss_promo",
  "optional": true,
  "field_visible_timeout_ms": 2000
}
```

**How `optional` differs from `null` values:**

| Mechanism | Element lookup | Use case |
|-----------|---------------|----------|
| Value = `null` in input_data | No (skipped immediately) | Field exists but should not be filled for this data row |
| `"optional": true` | Yes (waits for visibility) | Field may or may not exist on the page |

When `optional` is set and the element is not found or interaction fails, the handler logs an info message and continues to the next field. No error is raised regardless of the `strict` mode.

**Tip:** Set a low `field_visible_timeout_ms` (e.g., 1000-2000ms) on optional fields to avoid waiting the full default timeout when the element is absent.

**Note:** For skipping entire steps (all fields + next_action), use [Optional Steps](#optional-steps) instead.

### Action Types

#### CLICK

Click a button or link:

```json
{
  "type": "click",
  "selector": "button[type='submit']"
}
```

With iframe support:

```json
{
  "type": "click",
  "selector": "button:has-text('Next')",
  "iframe_selector": "iframe#form-frame"
}
```

#### WAIT

Wait for an element to appear:

```json
{
  "type": "wait",
  "selector": ".loading-complete"
}
```

#### WAIT_HIDDEN

Wait for an element to disappear:

```json
{
  "type": "wait_hidden",
  "selector": ".loading-spinner"
}
```

#### SCROLL

Scroll to an element:

```json
{
  "type": "scroll",
  "selector": "#section-bottom"
}
```

#### DELAY

Wait for a fixed time:

```json
{
  "type": "delay",
  "delay_ms": 2000
}
```

#### CONDITIONAL

Execute actions based on conditions:

```json
{
  "type": "conditional",
  "condition": {
    "type": "selector_visible",
    "selector": ".error-message"
  },
  "actions": [
    {
      "type": "click",
      "selector": "button.dismiss-error"
    }
  ]
}
```

Condition types:
- `selector_visible`: True if selector is visible on the page
- `selector_hidden`: True if selector is NOT visible on the page
- `data_equals`: True if `input_data[key]` equals `value`
- `data_exists`: True if `input_data[key]` is truthy

### Common Action Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `selector` | string | Target element selector |
| `iframe_selector` | string | Selector for parent iframe |
| `delay_ms` | integer | Delay/timeout in ms (for `delay`, `wait`, `wait_hidden` actions) |
| `condition` | object | Condition definition (for `conditional` actions) |
| `actions` | array | Nested actions to execute if condition is met (for `conditional` actions) |
| `pre_action_delay_ms` | integer | Wait before executing action |
| `post_action_delay_ms` | integer | Wait after executing action |
| `retry_config` | object | Retry configuration for transient failures (see [Retry and Resilience](#retry-and-resilience)) |

### Final Page Configuration

```json
{
  "final_page": {
    "wait_for": ".result-container, .confirmation",
    "timeout_ms": 60000,
    "post_wait_delay_ms": 2000,
    "screenshot_selector": ".result-panel"
  }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `wait_for` | string | Conditional | CSS selector to wait for (required if `wait_for_data_key` not set) |
| `wait_for_data_key` | string | Conditional | Data key for exact text match (uses `text="{value}"` selector). Required if `wait_for` not set. |
| `timeout_ms` | integer | Yes | Timeout in milliseconds for waiting |
| `post_wait_delay_ms` | integer | No | Delay in ms after selector found, for SPA content to render (default: 0) |
| `screenshot_selector` | string | No | Element to screenshot (null for full page) |

Only one of `wait_for` or `wait_for_data_key` can be set (not both).

### Data Extraction Configuration

Extract structured data from the final page using CSS selectors:

```json
{
  "data_extraction": {
    "fields": {
      "tier_prices": {
        "selector": ".price-value",
        "attribute": null,
        "regex": "([0-9]+[.,][0-9]{2})",
        "multiple": true,
        "iframe_selector": "iframe#form-frame"
      },
      "selected_price": {
        "selector": "#total-amount",
        "attribute": "data-value",
        "regex": null,
        "multiple": false
      }
    }
  }
}
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `selector` | string | Yes | CSS selector to locate element(s) |
| `attribute` | string | No | Element attribute to extract (null for text content) |
| `regex` | string | No | Regex pattern to apply (uses first capture group if present) |
| `multiple` | boolean | Yes | True for list of all matches, False for first match only |
| `iframe_selector` | string | No | CSS selector for iframe if element is inside one |

The `data_extraction` config also supports `nested_fields` to combine flat extracted arrays into structured outputs (`paired_dict` or `object_list`). See [docs/README.md](azcrawlerpy/docs/README.md) for details.

Extracted data is available in the crawl result:

```python
result = await crawler.crawl(...)
print(result.extracted_data)
# {'tier_prices': ['32,28', '35,26', '50,34'], 'selected_price': '35,26'}
```

## Data Points (input_data)

The `input_data` dictionary provides values for form fields. Keys must match `data_key` values in the instructions.

### Structure

```json
{
  "email": "user@example.com",
  "first_name": "John",
  "last_name": "Doe",
  "birthdate": "1985-06-15",
  "country": "Germany",
  "accept_terms": true,
  "newsletter": false,
  "premium_option": null
}
```

### Value Types

| Type | Description | Example |
|------|-------------|---------|
| String | Text values, dropdown selections | `"John"` |
| Boolean | Checkbox/radio toggle | `true`, `false` |
| Null | Skip this field entirely (no DOM interaction) | `null` |
| Integer/Float | Numeric inputs, sliders | `12000`, `99.99` |

Note: Setting a value to `null` skips the field without any DOM interaction. This is different from `"optional": true` on the field definition, which attempts the interaction but tolerates failure. See [Optional Fields](#optional-fields).

### Radio Button Patterns

**Pattern A - Mutually exclusive options with value selector**:
```json
{
  "gender": "male"
}
```
Selector uses `${value}` placeholder: `input[value='${value}']`

**Pattern B - Boolean flags for each option**:
```json
{
  "option_a": true,
  "option_b": null,
  "option_c": null
}
```
Only the option with `true` gets clicked.

### Date Handling

Dates in input_data should use ISO format (`YYYY-MM-DD`):
```json
{
  "birthdate": "1985-06-15",
  "start_date": "2024-01-01"
}
```

The framework converts to the format specified in the field definition.

## Element Discovery

The `ElementDiscovery` class scans web pages to identify interactive elements, helping build instructions.json files.

```python
from pathlib import Path
from azcrawlerpy import ElementDiscovery

async def discover_elements():
    discovery = ElementDiscovery(headless=False)

    report = await discovery.discover(
        url="https://example.com/form",
        output_dir=Path("./discovery_output"),
        cookie_consent={
            "banner_selector": "#cookie-banner",
            "accept_selector": "button.accept"
        },
        explore_iframes=True,
        screenshot=True,
    )

    print(f"Found {report.total_elements} elements")

    for text_input in report.text_inputs:
        print(f"Text input: {text_input.selector}")
        print(f"  Suggested type: {text_input.suggested_field_type}")

    for dropdown in report.selects:
        print(f"Dropdown: {dropdown.selector}")
        print(f"  Options: {dropdown.options}")

    for radio_group in report.radio_groups:
        print(f"Radio group: {radio_group.name}")
        for option in radio_group.options:
            print(f"  - {option.label}: {option.selector}")
```

### Discovery Report Contents

- `text_inputs`: Text, email, phone, password fields
- `textareas`: Multi-line text areas
- `selects`: Native dropdown elements with options
- `radio_groups`: Grouped radio buttons
- `checkboxes`: Checkbox inputs
- `buttons`: Clickable buttons
- `links`: Anchor elements
- `date_inputs`: Date picker fields
- `file_inputs`: File upload fields
- `sliders`: Range inputs
- `custom_components`: Non-standard interactive elements
- `iframes`: Discovered iframes with their elements

## AI Agent Guidance

This section provides instructions for AI agents tasked with creating `instructions.json` and `input_data` files.

### Workflow for Creating Instructions

1. **Discovery Phase**: Use `ElementDiscovery` to scan each page/step of the form
2. **Mapping Phase**: Map discovered elements to field definitions
3. **Flow Definition**: Define step transitions and actions
4. **Data Schema**: Create the input_data structure

### Step-by-Step Process

#### 1. Analyze the Form Structure

- Identify how many pages/steps the form has
- Note the URL pattern changes (if any)
- Identify what element appears when each step loads

#### 2. For Each Step, Define:

```json
{
  "name": "<descriptive_step_name>",
  "wait_for": "<selector_that_confirms_step_loaded>",
  "timeout_ms": 15000,
  "fields": [...],
  "next_action": {...}
}
```

**Naming conventions**:
- Use snake_case for step names: `personal_info`, `payment_details`
- Use descriptive data_keys: `first_name`, `email_address`, `accepts_terms`

#### 3. Selector Priority

When choosing selectors, prefer in order:
1. `[data-testid='...']` or `[data-cy='...']` - Most stable
2. `[aria-label='...']` or `[aria-labelledby='...']` - Accessible and stable
3. `input[name='...']` - Form field names
4. `:has-text('...')` - Text content (use for buttons/labels)
5. CSS class selectors - Least stable, avoid if possible

#### 4. Handle Dynamic Content

For AJAX-loaded content:
- Use `wait` action before interacting
- Add `field_visible_timeout_ms` to field definitions
- Use `post_click_delay_ms` for fields that trigger updates

#### 5. Radio Button Strategy

**Option A - When radio values are meaningful**:
```json
{
  "type": "radio",
  "selector": "input[type='radio'][value='${value}']",
  "data_key": "payment_type"
}
// data: { "payment_type": "credit_card" }
```

**Option B - When you need individual control**:
```json
{
  "type": "radio",
  "selector": "[role='radio']:has-text('Credit Card')",
  "data_key": "payment_credit_card",
  "force_click": true
},
{
  "type": "radio",
  "selector": "[role='radio']:has-text('PayPal')",
  "data_key": "payment_paypal",
  "force_click": true
}
// data: { "payment_credit_card": true, "payment_paypal": null }
```

#### 6. Iframe Handling

When elements are inside iframes:
```json
{
  "type": "text",
  "selector": "input[name='card_number']",
  "iframe_selector": "iframe#payment-iframe",
  "data_key": "card_number"
}
```

### Creating input_data

#### 1. Analyze Required Fields

From the instructions, extract all unique `data_key` values:
```python
data_keys = set()
for step in instructions["steps"]:
    for field in step["fields"]:
        if field.get("data_key"):
            data_keys.add(field["data_key"])
```

#### 2. Determine Value Types

| Field Type | Data Type | Example |
|------------|-----------|---------|
| text, textarea | string | `"John Doe"` |
| dropdown | string | `"Germany"` |
| radio (value-driven) | string | `"option_a"` |
| radio (boolean) | boolean/null | `true` or `null` |
| checkbox | boolean | `true` / `false` |
| date | string (ISO) | `"1985-06-15"` |
| slider | number | `50000` |
| file | string (path) | `"/path/to/file.pdf"` |

#### 3. Handle Mutually Exclusive Options

For radio groups with boolean flags, only ONE should be `true`:
```json
{
  "employment_fulltime": true,
  "employment_parttime": null,
  "employment_selfemployed": null,
  "employment_unemployed": null
}
```

#### 4. Date Format

Always provide dates in ISO format in input_data:
```json
{
  "birthdate": "1985-06-15",
  "policy_start": "2024-01-01"
}
```

The instructions specify the output format for the specific form.

### Common Patterns

#### Multi-Step Wizard
```json
{
  "steps": [
    {
      "name": "step_1_personal",
      "wait_for": "input[name='firstName']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    },
    {
      "name": "step_2_address",
      "wait_for": "input[name='street']",
      "fields": [...],
      "next_action": { "type": "click", "selector": "button:has-text('Next')" }
    }
  ]
}
```

#### Form with Loading States
```json
{
  "next_action": {
    "type": "click",
    "selector": "button[type='submit']",
    "post_action_delay_ms": 1000
  }
}
```

#### Conditional Fields
```json
{
  "type": "conditional",
  "condition": {
    "type": "data_equals",
    "key": "has_additional_driver",
    "value": true
  },
  "actions": [
    {
      "type": "click",
      "selector": "button:has-text('Add Driver')"
    }
  ]
}
```

## Retry and Resilience

The framework provides built-in retry and fallback mechanisms for handling transient failures during web interactions.

### Retry Configuration

Add `retry_config` to any field, action, step extraction, or data extraction to enable automatic retry with exponential backoff:

```json
{
  "type": "text",
  "selector": "input[name='email']",
  "data_key": "email",
  "retry_config": {
    "max_attempts": 3,
    "base_delay_ms": 500,
    "backoff_multiplier": 1.5
  }
}
```

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `max_attempts` | integer | Yes | Total attempts (1 = no retry, 2 = one retry, etc.) |
| `base_delay_ms` | integer | Yes | Base delay between retries in milliseconds |
| `backoff_multiplier` | float | No | Multiplier per retry (default: 1.5). Delay = `base_delay_ms * backoff_multiplier^(attempt-1)` |

Retryable errors include Playwright `TimeoutError` and `Error` (element not found, intercepted clicks, etc.).

### Click Fallback Chain

All click interactions (`click` actions, `click_only`, `radio`, `click_select` fields) use an escalating fallback strategy when a click fails:

1. **Normal click** (or force click if `force_first=True`)
2. **Force click** (or normal click if `force_first=True`) -- bypasses actionability checks
3. **JS dispatchEvent** -- fires a `MouseEvent` via `element.dispatchEvent()`
4. **JS element.click()** -- calls `element.click()` via `document.querySelector()`

This handles common real-world issues like overlay interception, sticky headers, and elements that pass Playwright's visibility checks but fail to receive pointer events.

### Value Verification

`text` and `textarea` fields verify the actual input value after filling. If the value written to the field does not match the expected value, a `FieldInteractionError` is raised. This catches silent data corruption from autofill interference, input masks, or JavaScript reformatting.

Set `skip_verification: true` on the field to disable this check for fields where the site intentionally transforms the input (e.g., phone number formatting).

### ComboBox Degraded Retry

When `retry_config` is set on a `combobox` field, retries use degraded parameters to handle slow autocomplete responses:
- Typing delay scales by `1.5^(attempt-1)` (slower typing gives autocomplete more time)
- The field is always cleared before retyping on retry attempts

## Error Handling and Diagnostics

The framework provides detailed error information when failures occur.

### Exception Types

| Exception | When Raised |
|-----------|-------------|
| `FieldNotFoundError` | Selector doesn't match any element (not raised for `optional` fields) |
| `FieldInteractionError` | Element found but interaction failed, or value mismatch after fill (not raised for `optional` fields) |
| `CrawlerTimeoutError` | Wait condition not met within timeout |
| `NavigationError` | Navigation action failed |
| `MissingDataError` | Required data_key not in input_data |
| `InvalidInstructionError` | Malformed instructions JSON |
| `UnsupportedFieldTypeError` | Unknown field type specified |
| `UnsupportedActionTypeError` | Unknown action type specified |
| `IframeNotFoundError` | Specified iframe not found |
| `DataExtractionError` | Data extraction from final page failed |

### Debug Mode

Enable debug mode to capture screenshots at various stages:

```python
from azcrawlerpy import DebugMode

result = await crawler.crawl(
    ...,
    debug_mode=DebugMode.ALL,  # Capture all screenshots
)
```

| Mode | Description |
|------|-------------|
| `NONE` | No debug screenshots |
| `START` | Screenshot at form start |
| `END` | Screenshot at form end |
| `ALL` | Screenshots after every field and action |

### AI Diagnostics

When errors occur with debug mode enabled, the framework captures:

- Current page URL and title
- Available `data-cy` and `data-testid` selectors
- Visible buttons and input fields
- Similar selectors (fuzzy matching suggestions)
- Console errors and warnings
- Failed network requests
- XHR/fetch API responses (URL, status, method) for debugging SPA extraction failures
- HTML snippet of the form area
- Error screenshot (saved to disk or captured in-memory)

This information is included in the exception message and saved to `error_diagnostics.json` (when `output_dir` is set).

### In-Memory Error Diagnostics

When running with `output_dir=None`, error diagnostics are fully available in-memory. The error screenshot is captured as bytes and accessible in two ways:

1. On the exception's diagnostics: `exception.diagnostics.screenshot_bytes`
2. As the last entry in the partial result: `exception.partial_result.screenshots[-1]`

```python
try:
    result = await crawler.crawl(
        url=url,
        input_data=input_data,
        instructions=instructions,
        output_dir=None,
        debug_mode=DebugMode.ALL,
    )
except CrawlerError as e:
    # Access the error screenshot bytes directly from diagnostics
    if e.diagnostics and e.diagnostics.screenshot_bytes:
        error_screenshot = e.diagnostics.screenshot_bytes

    # Or from the partial result (includes all screenshots taken during the crawl)
    if e.partial_result:
        all_screenshots = e.partial_result.screenshots  # includes error screenshot as last item
        extracted_so_far = e.partial_result.extracted_data
        steps_done = e.partial_result.steps_completed
```

## Browser Profile Building (Profiler)

The profiler visits random sites before the main crawl to accumulate cookies and storage, making the browser appear more natural. This is configured via the `profiler` field in instructions.json.

### Profiler Modes

The profiler supports two storage modes:

**Disk mode** (`storage_path` set): Persists the browser profile to disk.
- Camoufox: saves as a Firefox user data directory (`{storage_path}.camoufox/`)
- Chromium: saves as a JSON file (`{storage_path}.chromium.json`)

**In-memory mode** (`storage_path` omitted or `null`): No permanent files written.
- Chromium: returns storage state as a dict, passed directly to `browser.new_context(storage_state=dict)`
- Camoufox: uses a temporary directory (cleaned up on error, passed as `user_data_dir` during crawl)

### Profiler Configuration

```json
{
  "profiler": {
    "sites": [
      {
        "url": "https://www.google.de",
        "cookie_consent": {
          "banner_selector": "[role='dialog'], .cookie-banner",
          "accept_selector": "button[id*='accept'], button[id*='agree']",
          "js_fallback_texts": ["akzept", "accept", "zustimm"]
        },
        "browse_delay_ms": 2000
      },
      {
        "url": "https://en.wikipedia.org",
        "browse_delay_ms": 3000
      }
    ],
    "visit_count": 3,
    "ignore_errors": true,
    "storage_path": null,
    "inter_site_delay_ms": 1000
  }
}
```

### ProfilerConfig Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `sites` | array | Yes | List of `ProfilerSiteConfig` objects to randomly select from |
| `visit_count` | integer | Yes | Number of sites to randomly visit (must be <= length of `sites`) |
| `ignore_errors` | boolean | Yes | If true, continue profiling when a single site visit fails |
| `storage_path` | string or null | No | Base path for profile storage. When null (default), runs in-memory mode |
| `inter_site_delay_ms` | integer | No | Delay in ms between visiting each site |

### ProfilerSiteConfig Parameters

Each site in the `sites` array supports per-site cookie consent handling:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to visit for profile building |
| `cookie_consent` | object | No | Cookie consent config for this specific site (same schema as top-level `cookie_consent`) |
| `browse_delay_ms` | integer | No | Time in ms to linger on the page after cookie consent handling |

### ProfilerResult

The profiler returns a `ProfilerResult` with:

| Field | Type | Description |
|-------|------|-------------|
| `visited_urls` | list[str] | URLs that were visited during profiling |
| `storage_state_path` | Path or None | Path to persisted profile (disk mode), or temp dir path (Camoufox in-memory) |
| `storage_state` | dict or None | In-memory storage state dict (Chromium in-memory mode only) |

The crawler automatically passes the profiler result to the browser context creation, so no manual wiring is needed.

## Examples

### Insurance Quote Form

**instructions.json**:
```json
{
  "url": "https://insurance.example.com/quote",
  "browser_config": {
    "viewport_width": 1920,
    "viewport_height": 1080
  },
  "cookie_consent": {
    "banner_selector": "#cookie-banner",
    "accept_selector": "button:has-text('Accept')"
  },
  "steps": [
    {
      "name": "vehicle_info",
      "wait_for": "input[name='hsn']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='hsn']",
          "data_key": "vehicle_hsn"
        },
        {
          "type": "text",
          "selector": "input[name='tsn']",
          "data_key": "vehicle_tsn"
        },
        {
          "type": "date",
          "selector": "input[name='registration_date']",
          "data_key": "first_registration",
          "type_config": {
            "format": "MM.YYYY"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Continue')"
      }
    },
    {
      "name": "personal_info",
      "wait_for": "input[name='birthdate']",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "date",
          "selector": "input[name='birthdate']",
          "data_key": "birthdate",
          "type_config": {
            "format": "DD.MM.YYYY"
          }
        },
        {
          "type": "text",
          "selector": "input[name='zipcode']",
          "data_key": "postal_code"
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Get Quote')"
      }
    }
  ],
  "final_page": {
    "wait_for": ".quote-result",
    "timeout_ms": 60000,
    "screenshot_selector": ".quote-panel"
  }
}
```

**data_row.json**:
```json
{
  "vehicle_hsn": "0603",
  "vehicle_tsn": "AKZ",
  "first_registration": "2020-03-15",
  "birthdate": "1985-06-20",
  "postal_code": "80331"
}
```

### Form with Iframes

```json
{
  "steps": [
    {
      "name": "embedded_form",
      "wait_for": "iframe#form-frame",
      "timeout_ms": 15000,
      "fields": [
        {
          "type": "text",
          "selector": "input[name='email']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "email"
        },
        {
          "type": "dropdown",
          "selector": "select[name='plan']",
          "iframe_selector": "iframe#form-frame",
          "data_key": "selected_plan",
          "type_config": {
            "select_by": "text"
          }
        }
      ],
      "next_action": {
        "type": "click",
        "selector": "button:has-text('Submit')",
        "iframe_selector": "iframe#form-frame"
      }
    }
  ]
}
```

## License

MIT
