Metadata-Version: 2.4
Name: sf-config-builder
Version: 0.1.6
Summary: Manage Screaming Frog configs programmatically
Author: Antonio
License: MIT
Project-URL: Homepage, https://github.com/Amaculus/sf-config-builder
Project-URL: Documentation, https://github.com/Amaculus/sf-config-builder#readme
Project-URL: Repository, https://github.com/Amaculus/sf-config-builder
Project-URL: Issues, https://github.com/Amaculus/sf-config-builder/issues
Keywords: seo,screaming-frog,crawling,automation,config
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# sf-config-builder

Manage Screaming Frog `.seospiderconfig` files programmatically.

## Installation

```bash
pip install sf-config-builder
```

### Requirements

- **Screaming Frog SEO Spider** must be installed (provides JARs for deserialization)
- Python 3.8+

## Quick Start

```python
from sfconfig import SFConfig

# Load existing config (auto-detects SF installation)
config = SFConfig.load("base.seospiderconfig")

# Or specify custom SF path
config = SFConfig.load("base.seospiderconfig", sf_path="D:/Apps/Screaming Frog SEO Spider")

# Configure for e-commerce audit
config.max_urls = 100000
config.rendering_mode = "JAVASCRIPT"

# Enable CSS/JS crawling (recommended for JS rendering)
config.set("mCrawlConfig.mCrawlCSS", True)
config.set("mCrawlConfig.mCrawlJavaScript", True)

# Add custom extractions
config.add_extraction("Price", "//span[@class='price']")
config.add_extraction("SKU", "//span[@itemprop='sku']")
config.add_extraction("Stock", ".availability", selector_type="CSS")

# Add exclude patterns
config.add_exclude(r".*\.pdf$")
config.add_exclude(r".*/admin/.*")

# Save and run
config.save("client-audit.seospiderconfig")
config.run_crawl("https://example.com", output_folder="./results")
```

## Features

### Inspect Configs

```python
config = SFConfig.load("my.seospiderconfig")

# Get specific field
max_urls = config.get("mCrawlConfig.mMaxUrls")

# List all fields
for field in config.fields():
    print(f"{field['path']}: {field['value']}")

# Filter by prefix
crawl_fields = config.fields(prefix="mCrawlConfig")
```

### Modify Configs

```python
# Direct field access
config.set("mCrawlConfig.mMaxUrls", 100000)

# Convenience properties
config.max_urls = 100000
config.max_depth = 10
config.rendering_mode = "JAVASCRIPT"  # STATIC | JAVASCRIPT
config.robots_mode = "IGNORE"         # RESPECT | IGNORE
config.crawl_delay = 0.5
config.user_agent = "MyBot/1.0"

# CSS/JS crawling & storage
config.set("mCrawlConfig.mCrawlCSS", True)
config.set("mCrawlConfig.mStoreCSS", True)
config.set("mCrawlConfig.mCrawlJavaScript", True)
config.set("mCrawlConfig.mStoreJavaScript", True)

# Performance / rate limiting
config.set("mPerformanceConfig.mLimitPerformance", True)
config.set("mPerformanceConfig.mUrlRequestsPerSecond", 5.0)
```

> **Note**: When switching to `JAVASCRIPT` rendering mode, Screaming Frog's GUI
> automatically enables CSS/JS crawling. When building configs programmatically,
> you should explicitly set `mCrawlCSS` and `mCrawlJavaScript` to `True` to
> ensure proper rendering.

### Custom Extractions

```python
# Add extraction rules
config.add_extraction(
    name="Price",
    selector="//span[@class='price']",
    selector_type="XPATH",      # XPATH | CSS | REGEX
    extract_mode="TEXT"         # TEXT | HTML_ELEMENT | INNER_HTML
)

# List extractions
for ext in config.extractions:
    print(f"{ext['name']}: {ext['selector']}")

# Remove by name
config.remove_extraction("Price")

# Clear all
config.clear_extractions()
```

### Custom Searches

```python
# Add custom search filters
config.add_custom_search(
    name="Filter 1",
    query=".*",
    mode="CONTAINS",
    data_type="REGEX",
    scope="HTML",
    case_sensitive=False
)

# Remove by name
config.remove_custom_search("Filter 1")

# Clear all
config.clear_custom_searches()
```

### Custom JavaScript

```python
# Add custom JavaScript extraction
config.add_custom_javascript(
    name="Extractor 1",
    javascript="return document.title;",
    script_type="EXTRACTION",
    timeout_secs=10,
    content_types="text/html"
)

# Remove by name
config.remove_custom_javascript("Extractor 1")

# Clear all
config.clear_custom_javascript()
```

### Exclude/Include Patterns

```python
# Excludes (URLs matching these patterns are skipped)
config.add_exclude(r".*\.pdf$")
config.add_exclude(r".*/admin/.*")

# Includes (only URLs matching these are crawled)
config.add_include(r".*/products/.*")

# List patterns
print(config.excludes)
print(config.includes)
```

### Compare Configs

```python
from sfconfig import SFConfig

diff = SFConfig.diff("old.seospiderconfig", "new.seospiderconfig")

if diff.has_changes:
    print(f"Found {diff.change_count} differences:")
    print(diff)

# Filter by prefix
crawl_changes = diff.changes_for("mCrawlConfig")
```

### Test Extractions

```python
# Test selector against live URL before full crawl
result = config.test_extraction(
    url="https://example.com/product",
    selector="//span[@class='price']",
    selector_type="XPATH"
)

if result["match_count"] > 0:
    print(f"Found: {result['matches']}")
else:
    print("Selector didn't match - fix before crawling")
```

### Run Crawls

```python
# Blocking crawl
config.run_crawl(
    url="https://example.com",
    output_folder="./results",
    export_tabs=["Internal:All", "Response Codes:All"],
    export_format="csv",
    timeout=3600
)

# Async crawl
process = config.run_crawl_async(
    url="https://example.com",
    output_folder="./results"
)
# Do other work...
process.wait()  # Block until complete
```

## Multi-Client Workflow

```python
from sfconfig import SFConfig

clients = [
    {"domain": "client1.com", "max_urls": 50000},
    {"domain": "client2.com", "max_urls": 100000},
]

for client in clients:
    config = SFConfig.load("agency-base.seospiderconfig")
    config.max_urls = client["max_urls"]
    config.add_extraction("Price", "//span[@class='price']")

    config.save(f"/tmp/{client['domain']}.seospiderconfig")
    config.run_crawl(
        url=f"https://{client['domain']}",
        output_folder=f"./results/{client['domain']}"
    )
```

## Error Handling

```python
from sfconfig import (
    SFConfig,
    SFNotFoundError,
    SFValidationError,
    SFParseError,
    SFCrawlError
)

try:
    config = SFConfig.load("my.seospiderconfig")
    config.set("mInvalidField", 123)
    config.save()
except SFNotFoundError:
    print("Install Screaming Frog first")
except SFValidationError as e:
    print(f"Invalid field: {e}")
except SFParseError as e:
    print(f"Could not parse config: {e}")
except SFCrawlError as e:
    print(f"Crawl failed: {e}")
```

## Environment Variables

| Variable | Description |
|----------|-------------|
| `SF_PATH` | Custom path to SF's JAR directory |
| `SF_CLI_PATH` | Custom path to SF CLI executable |
| `JAVA_HOME` | Custom Java installation path |

## Architecture

```
User Python code
       |
       v
+------------------+
|  sfconfig        |  (Python wrapper)
|  - SFConfig      |
|  - SFDiff        |
+--------+---------+
         | subprocess.run()
         v
+------------------+
|  ConfigBuilder   |  (Java CLI, bundled ~50KB)
|  .jar            |
+--------+---------+
         | classpath includes
         v
+------------------+
|  SF's JARs       |  (from user's local SF install, NOT bundled)
+------------------+
```

At runtime, the library builds a classpath combining:
- `ConfigBuilder.jar` (bundled with this package)
- `{SF_INSTALL_PATH}/*` (user's local Screaming Frog JARs)

This means:
- Only our small JAR is distributed (no licensing issues)
- SF's proprietary JARs are used from the user's existing installation
- Compatibility is maintained across SF versions

## Development

### Building the Java CLI

The Java source (`ConfigBuilder.java`) is included alongside the JAR in `sfconfig/java/`. To rebuild after modifications:

```bash
cd sfconfig/java

# Compile against SF's JARs (as compile-time dependency)
javac -cp "/path/to/Screaming Frog SEO Spider/lib/*" ConfigBuilder.java

# Package into JAR (include source for transparency)
jar cfe ConfigBuilder.jar ConfigBuilder *.class ConfigBuilder.java

# Clean up loose class files
rm -f *.class
```

**Important**: Only bundle `ConfigBuilder.jar`. Do NOT bundle any JARs from SF's install directory - those are proprietary and already on the user's machine.

### Installing for Development

```bash
cd sf-config-tool
pip install -e ".[dev]"
pytest tests/
```

## License

MIT
