Metadata-Version: 2.3
Name: eurlxp
Version: 0.5.0
Summary: A modern EUR-Lex parser for Python - fetch and parse EU legal documents
Keywords: eurlex,eu,legal,parser,regulations,directives,celex
Author: Moritz Schlichting
Author-email: Moritz Schlichting <moritz@info.nl>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Legal Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Typing :: Typed
Requires-Dist: httpx>=0.27.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: eurlxp[sparql,dev] ; extra == 'all'
Requires-Dist: pytest>=8.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0 ; extra == 'dev'
Requires-Dist: syrupy>=5.1.0 ; extra == 'dev'
Requires-Dist: ruff>=0.4.0 ; extra == 'dev'
Requires-Dist: pyright>=1.1.0 ; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0 ; extra == 'dev'
Requires-Dist: sparqlwrapper>=2.0.0 ; extra == 'sparql'
Requires-Dist: rdflib>=7.0.0 ; extra == 'sparql'
Maintainer: Moritz Schlichting
Maintainer-email: Moritz Schlichting <moritz@info.nl>
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/morrieinmaas/eurlxp
Project-URL: Documentation, https://github.com/morrieinmaas/eurlxp#readme
Project-URL: Repository, https://github.com/morrieinmaas/eurlxp.git
Project-URL: Issues, https://github.com/morrieinmaas/eurlxp/issues
Project-URL: Changelog, https://github.com/morrieinmaas/eurlxp/blob/main/CHANGELOG.md
Provides-Extra: all
Provides-Extra: dev
Provides-Extra: sparql
Description-Content-Type: text/markdown

# eurlxp

<p>
    <a href="https://github.com/morrieinmaas/eurlxp/actions/workflows/ci.yml"><img src="https://github.com/morrieinmaas/eurlxp/actions/workflows/ci.yml/badge.svg" alt="CI" height="18"></a>
    <a href="https://badge.fury.io/py/eurlxp"><img src="https://badge.fury.io/py/eurlxp.svg" alt="PyPI version" height="18"></a>
    <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff" height="18"></a>
    <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue" alt="Python versions" height="18">
</p>

A modern EUR-Lex parser for Python. Fetch and parse EU legal documents with async support, type hints, and a CLI.

> **Note**: This is a modern rewrite inspired by [kevin91nl/eurlex](https://github.com/kevin91nl/eurlex), built with UV, httpx, Pydantic, and Typer.

## Features

- **Modern Python** - Supports Python 3.10-3.14
- **Async support** - Fetch multiple documents concurrently
- **Type hints** - Full type annotations for IDE support
- **CLI** - Command-line interface with Typer
- **Pydantic models** - Validated, structured data
- **Drop-in compatible** - Same API as the original eurlex package
- **Bot detection handling** - Browser-like headers and WAF challenge detection
- **Rate limiting** - Configurable delays between requests
- **SPARQL support** - Alternative data source that bypasses HTML scraping
- **PDF extraction** - Automatic text extraction from PDF for older documents without HTML

## Installation

```bash
# Using pip
pip install eurlxp

# Using uv
uv add eurlxp

# With SPARQL support (required for get_celex_dataframe, run_query, get_regulations, etc.)
pip install eurlxp[sparql]
# or
uv add eurlxp[sparql]
```

> **Note**: SPARQL functions (`get_celex_dataframe`, `run_query`, `get_regulations`, `get_documents`, `guess_celex_ids_via_eurlex`) require the optional `sparql` dependencies. If you see `ImportError: SPARQL dependencies not installed`, install with `pip install eurlxp[sparql]`.

> **PDF extraction**: Included by default (no extra install needed). Older documents without HTML are automatically extracted from PDF.

## How It Works

This package fetches EU legal documents from EUR-Lex using their public HTML endpoints:

```text
https://eur-lex.europa.eu/legal-content/{LANG}/TXT/HTML/?uri=CELEX:{CELEX_ID}
```

You can verify this manually with curl:

```bash
# Fetch a regulation (EU Drone Regulation 2019/947)
curl -s "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019R0947" | head -50

# Or with a different language (German)
curl -s "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:32019R0947" | head -50
```

The equivalent using this package's CLI:

```bash
# Fetch as HTML
uvx eurlxp fetch 32019R0947 --format html | head -50

# Fetch and parse to JSON
uvx eurlxp fetch 32019R0947 --format json | head -30

# Fetch and parse to CSV
uvx eurlxp fetch 32019R0947 --format csv | head -10

# Get document info (shows row count, articles, etc.)
uvx eurlxp info 32019R0947
```

## Quick Start

```python
from eurlxp import get_html_by_celex_id, parse_html, WAFChallengeError

# Fetch and parse a regulation
celex_id = "32019R0947"
try:
    html = get_html_by_celex_id(celex_id)
    df = parse_html(html)

    # Get Article 1
    df_article_1 = df[df.article == "1"]
    print(df_article_1.iloc[0].text)
    # "This Regulation lays down detailed provisions for the operation of unmanned aircraft systems..."
except WAFChallengeError:
    print("Bot detection triggered - try using SPARQL functions instead")
```

### Async Usage

```python
import asyncio
from eurlxp import AsyncEURLexClient, parse_html

async def fetch_documents():
    # Use rate limiting to avoid bot detection
    async with AsyncEURLexClient(request_delay=2.0) as client:
        # Fetch multiple documents concurrently
        docs = await client.fetch_multiple(["32019R0947", "32019R0945"])
        for celex_id, html in docs.items():
            df = parse_html(html)
            print(f"{celex_id}: {len(df)} rows")

asyncio.run(fetch_documents())
```

### Handling Bot Detection

EUR-Lex uses AWS WAF (Web Application Firewall) with JavaScript challenges to detect automated requests. **This cannot be bypassed in pure Python** because it requires JavaScript execution to solve a cryptographic puzzle. The library provides several strategies:

```python
from eurlxp import EURLexClient, ClientConfig, WAFChallengeError

# Strategy 1: Automatic SPARQL fallback (recommended)
# When WAF blocks HTML scraping, automatically fetch metadata via SPARQL
config = ClientConfig(sparql_fallback=True)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")  # Falls back to SPARQL if blocked

# Strategy 2: Use rate limiting to avoid triggering WAF
with EURLexClient(request_delay=2.0) as client:  # 2 second delay between requests
    html = client.get_html_by_celex_id("32019R0947")

# Strategy 3: Use custom configuration
config = ClientConfig(
    request_delay=3.0,           # Delay between requests
    use_browser_headers=True,    # Use browser-like headers (default)
    referer="https://eur-lex.europa.eu/",  # Add referer header
)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")

# Strategy 4: Handle WAF challenges manually
try:
    html = get_html_by_celex_id("32019R0947")
except WAFChallengeError:
    # Fall back to SPARQL manually
    from eurlxp import get_documents
    docs = get_documents(types=["REG"], limit=10)

# Strategy 5: Disable WAF exception (get raw challenge HTML)
config = ClientConfig(raise_on_waf=False)
with EURLexClient(config=config) as client:
    html = client.get_html_by_celex_id("32019R0947")  # Returns challenge HTML if blocked
```

> **Why can't we bypass WAF in Python?** AWS WAF requires a real browser to execute JavaScript that solves a cryptographic challenge and sets a cookie. HTTP libraries like httpx can't execute JavaScript. For browser automation, consider Playwright or Selenium, but SPARQL is the cleaner solution.

### Using SPARQL (Recommended for Bulk Data)

The SPARQL endpoint (`https://publications.europa.eu/webapi/rdf/sparql`) doesn't trigger bot detection and is ideal for bulk operations. It's the **recommended approach** when HTML scraping is blocked.

```python
from eurlxp import get_documents, get_regulations, run_query, guess_celex_ids_via_eurlex

# Convert slash notation to CELEX ID (uses SPARQL, not HTML scraping)
celex_ids = guess_celex_ids_via_eurlex("2019/947")
# Returns: ['32019R0947']

# Get list of regulations (returns CELLAR IDs)
cellar_ids = get_regulations(limit=100)

# Get documents with metadata
docs = get_documents(types=["REG", "DIR"], limit=50)
for doc in docs:
    print(f"{doc['celex']}: {doc['date']} - {doc['type']}")

# Run custom SPARQL queries
results = run_query("""
    SELECT ?doc ?celex WHERE {
        ?doc cdm:resource_legal_id_celex ?celex .
    } LIMIT 10
""")
```

**SPARQL functions include automatic retry with exponential backoff** for handling temporary 503 errors:

```python
from eurlxp import run_query, SPARQLServiceError

try:
    # Automatic retry: 3 attempts with 2s, 4s, 8s delays
    results = run_query(query)
    
    # Or customize retry behavior
    results = run_query(query, max_retries=5, retry_delay=3.0)
except SPARQLServiceError as e:
    print(f"SPARQL endpoint unavailable: {e}")
```

> **Note**: SPARQL functions require `pip install eurlxp[sparql]`

### Fetching Documents by Date (Bulk Downloads)

The most reliable way to bulk download EUR-Lex documents is to query by date range, which returns both the document IDs and direct cellar URLs:

```python
from eurlxp import get_ids_and_urls_via_date, get_html_by_cellar_url, parse_html, DateType

# Get documents published on a specific date
docs = get_ids_and_urls_via_date("2026-01-15")

# Or find documents MODIFIED in a date range (catches updates to old documents)
docs = get_ids_and_urls_via_date(
    "2026-01-01", "2026-01-31",
    date_type=DateType.MODIFIED
)

# Process each document
for doc in docs:
    print(f"ID: {doc.raw_id}")
    print(f"Valid CELEX: {doc.celex_id}")  # None if format is non-standard
    print(f"Cellar URL: {doc.cellar_url}")  # Always works for fetching

    # Fetch using the cellar URL (always works)
    html = get_html_by_cellar_url(doc.cellar_url)
    df = parse_html(html)
```

**Date type options:**
- `DateType.DOCUMENT` (default) - Publication date
- `DateType.MODIFIED` - Last modification date (finds amendments to old documents)
- `DateType.CREATED` - Creation date in CELLAR

### Understanding Document Identifiers

EUR-Lex uses several identifier formats. This package handles them all:

| Format | Example | Description |
|--------|---------|-------------|
| **CELEX ID** | `32019R0947` | Standard format: `[sector][year][type][number]` |
| **CELEX with suffix** | `32012L0029R(06)` | CELEX + revision indicator |
| **Cellar URL** | `http://publications.europa.eu/resource/cellar/abc123` | Direct URL (always works) |
| **Cellar ID** | `cellar:abc-123-def` or `abc-123-def` | UUID-based identifier |
| **OJ Reference** | `C/2026/00064` | Official Journal reference (not a CELEX) |

```python
from eurlxp import detect_id_type, get_html, fetch_documents, parse_celex_id

# Detect identifier type
detect_id_type("32019R0947")  # Returns: "celex"
detect_id_type("http://publications.europa.eu/resource/cellar/abc")  # Returns: "cellar_url"
detect_id_type("C/2026/00064")  # Returns: "oj_reference"

# Parse CELEX ID into components
parse_celex_id("32019R0947")
# Returns: {'sector': '3', 'year': '2019', 'doc_type': 'R', 'number': '0947', 'suffix': None}

# Fetch a document by any identifier type (auto-detects)
html = get_html("32019R0947")  # CELEX
html = get_html("http://publications.europa.eu/resource/cellar/abc123")  # URL
html = get_html("C/2026/00064")  # OJ reference - uses SPARQL to find cellar URL

# Batch fetch documents with mixed identifier types
results = fetch_documents([
    "32019R0947",  # CELEX
    "http://publications.europa.eu/resource/cellar/abc123",  # URL
    "C/2026/00064",  # OJ reference - looked up via SPARQL
])
```

**CELEX ID structure:**
- **Sector** (1 char): `3` = legislation, `5` = preparatory docs, `6` = case law, etc.
- **Year** (4 digits): Publication year
- **Type** (1-3 chars): `R` = regulation, `L` = directive, `D` = decision, etc.
- **Number** (2-5 digits): Document number

See the [official EUR-Lex documentation](https://eur-lex.europa.eu/content/help/eurlex-content/celex-number.html) for complete details.

### CLI Usage

```bash
# Fetch a document
eurlxp fetch 32019R0947 -o regulation.html

# Parse and convert to CSV
eurlxp fetch 32019R0947 -f csv -o regulation.csv

# Get document info
eurlxp info 32019R0947

# Convert slash notation to CELEX ID
eurlxp celex 2019/947
# Output: 32019R0947
```

## API Reference

### Functions

| Function | Description |
|----------|-------------|
| `get_html(identifier, language="en")` | Fetch HTML by any identifier (auto-detects type, uses SPARQL fallback) |
| `get_html_by_celex_id(celex_id, language="en")` | Fetch HTML by CELEX ID |
| `get_html_by_cellar_id(cellar_id, language="en")` | Fetch HTML by CELLAR ID |
| `get_html_by_cellar_url(cellar_url)` | Fetch HTML by cellar URL |
| `fetch_documents(identifiers, language="en", on_error="skip")` | Batch fetch documents (uses SPARQL fallback) |
| `detect_id_type(identifier)` | Detect identifier type |
| `lookup_cellar_url(identifier)` | Look up cellar URL for any identifier via SPARQL |
| `parse_html(html)` | Parse HTML to DataFrame |
| `get_celex_id(slash_notation, document_type="R", sector_id="3")` | Convert slash notation to CELEX ID |
| `get_possible_celex_ids(slash_notation)` | Get all possible CELEX IDs |
| `parse_celex_id(celex_id)` | Parse CELEX ID into components |
| `is_valid_celex_id(celex_id)` | Check if string is valid CELEX format |
| `get_ids_and_urls_via_date(from_date, to_date, date_type)` | Get document refs by date range |

### Classes

| Class | Description |
|-------|-------------|
| `EURLexClient` | Synchronous HTTP client with rate limiting and WAF detection |
| `AsyncEURLexClient` | Asynchronous HTTP client with rate limiting and WAF detection |
| `ClientConfig` | Configuration dataclass for client behavior |
| `WAFChallengeError` | Exception raised when bot detection is triggered |

### ClientConfig Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `timeout` | float | 30.0 | Request timeout in seconds |
| `headers` | dict | None | Custom headers to merge with defaults |
| `request_delay` | float | 0.0 | Delay between requests (rate limiting) |
| `use_browser_headers` | bool | True | Use browser-like headers to avoid detection |
| `referer` | str | None | Optional referer header |
| `raise_on_waf` | bool | True | Raise exception on WAF challenge |
| `sparql_fallback` | bool | True | Auto-fallback to SPARQL when WAF blocks requests |
| `max_retries` | int | 3 | Max retry attempts for transient HTTP errors (500/502/503/504) |
| `retry_delay` | float | 2.0 | Initial delay between retries (seconds) |
| `retry_backoff` | float | 2.0 | Exponential backoff multiplier |

### DataFrame Columns

| Column | Description |
|--------|-------------|
| `text` | The text content |
| `type` | Content type (text, link, etc.) |
| `document` | Document title |
| `article` | Article number |
| `article_subtitle` | Article subtitle |
| `paragraph` | Paragraph number |
| `group` | Group heading |
| `section` | Section heading |
| `ref` | Reference path (e.g., `["(1)", "(a)"]`) |

## Development

```bash
# Clone the repository
git clone https://github.com/morrieinmaas/eurlxp.git
cd eurlxp

# Install with dev dependencies
just dev

# Run tests
just test-unit

# Run all checks (lint + type check)
just check

# Format code
just format

# Run live tests with real documents (all ID formats)
just test-live

# See all available commands
just --list
```

## Publishing to PyPI

```bash
# Build the package
just build

# Publish to PyPI (requires PYPI_TOKEN)
just publish
```

## License

MIT License - see [LICENSE](LICENSE) for details.

## Credits

Inspired by [kevin91nl/eurlex](https://github.com/kevin91nl/eurlex).