Metadata-Version: 2.4
Name: universal-pii-firewall
Version: 0.1.0
Summary: Universal PII Firewall (UPF) Python SDK
Author: Aleksandr Kunavich
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/akunavich/universal-pii-firewall
Project-URL: Repository, https://github.com/akunavich/universal-pii-firewall
Project-URL: Issues, https://github.com/akunavich/universal-pii-firewall/issues
Keywords: pii,privacy,redaction,security,ocr,gdpr
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: image
Requires-Dist: Pillow>=10.0.0; extra == "image"
Requires-Dist: pytesseract>=0.3.10; extra == "image"
Provides-Extra: face
Requires-Dist: opencv-python-headless>=4.8.0; extra == "face"
Requires-Dist: numpy>=1.24.0; extra == "face"
Provides-Extra: ml
Requires-Dist: spacy>=3.7.0; extra == "ml"
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Dynamic: license-file

# Universal PII Firewall (UPF)

Production-ready Python package for privacy-first sanitization of text and OCR/image inputs before LLM processing.

`universal-pii-firewall` is designed for high-recall detection and deterministic redaction with a lightweight core install.

## Why UPF

- Multi-layer detection pipeline (deterministic IDs, context, multilingual heuristics, optional ML).
- Text and image workflows through one API surface.
- Zero dependencies for text sanitization — image extras are opt-in.
- In-memory processing model for sensitive content handling.
- Configurable redaction and risk controls for enterprise integration.

## Install

Core package:

```bash
pip install universal-pii-firewall
```

Optional extras:

```bash
# OCR/image text extraction + image redaction
pip install "universal-pii-firewall[image]"

# Optional face blur for image pipeline
pip install "universal-pii-firewall[face]"

# Optional ML NER layer
pip install "universal-pii-firewall[ml]"

# Dev and release tooling
pip install "universal-pii-firewall[dev]"
```

## Quick Start

### Text sanitization

```python
from upf import sanitize_text

text = "Alice Smith paid with 4111-1111-1111-1111 and emailed alice@example.com"
print(sanitize_text(text))
# -> [REDACTED:NAME] paid with [REDACTED:CREDIT_CARD] and emailed [REDACTED:EMAIL]
```

For a long-form, realistic before/after example (with risk score and detected entities),
run `examples/text_example.py --mode detailed`.

### OCR text sanitization

```python
from upf import sanitize_image

ocr_text = "John Doe IBAN DE89370400440532013000"
print(sanitize_image(ocr_text))
```

### Image bytes sanitization

```python
from upf import sanitize_image_bytes

with open("examples/inputs/1.png", "rb") as f:
    image_bytes = f.read()

result = sanitize_image_bytes(
    image_bytes,
    ocr_text="John Doe paid with 4111 1111 1111 1111 and email john@example.com",
)
print(result.sanitized_text)
print(result.risk_score, result.risk_level)
```

### Optional face blur

```python
from upf import UPFConfig, sanitize_image_bytes

cfg = UPFConfig(blur_faces=True, face_blur_strength=31)
with open("examples/inputs/2.png", "rb") as f:
    result = sanitize_image_bytes(
        f.read(),
        ocr_text="Alice Smith alice@example.com",
        config=cfg,
    )
```

### Optional signature blur

Detects handwritten signature regions via contour heuristics and blurs them.
Requires the `face` extra (OpenCV).

```python
from upf import UPFConfig, sanitize_image_bytes

cfg = UPFConfig(blur_signatures=True, signature_blur_strength=31)
with open("examples/inputs/1.png", "rb") as f:
    result = sanitize_image_bytes(
        f.read(),
        ocr_text="Signed by John Doe on 2026-03-06",
        config=cfg,
    )
```

Both face and signature blur can be combined:

```python
cfg = UPFConfig(
    blur_faces=True,
    face_blur_strength=51,
    blur_signatures=True,
    signature_blur_strength=31,
)
```

Enable via environment variables in `image_example.py --mode detailed`:

```bash
UPF_BLUR_FACES=true UPF_BLUR_SIGNATURES=true uv run python examples/image_example.py --mode detailed
```

If you omit `ocr_text`, install the `image` extra and ensure Tesseract OCR is available on your system.

## Configuration Knobs

Key `UPFConfig` controls:

- Detection toggles: `redact_names`, `redact_emails`, `redact_phones`, `redact_secrets`, `redact_addresses`, `redact_urls`, `redact_numeric_ids`, `redact_national_ids`
- Detector layers: `use_ml_ner`, `use_multilingual`, `use_relationship_detector`
- Image behavior: `blur_faces`, `face_blur_strength`, `blur_signatures`, `signature_blur_strength`
- Redaction behavior: `redaction_mode` (`label`, `mask`, `partial`, `pseudonym`, `skeleton`)
- Risk policy: `risk_mode`, `low_threshold`, `high_threshold`, `block_high_risk`, `deterministic_floor_types`

## Public API Reference

Stable exported interface from `upf`:

- `sanitize_text`
- `sanitize_text_with_details`
- `sanitize_image`
- `sanitize_image_bytes`
- `sanitize_image_base64`
- `secure_llm_call`
- `SecureLLMResult`
- `UPFConfig`
- `RedactionMode`
- `PseudonymSession`
- `HighRiskBlockedError`
- `ImageSanitizeResult`
- `TextSanitizeResult`

## Benchmark Methodology and Results

Text benchmark metrics below are from the included dataset and scripts:

- Command: `uv run python benchmarks/run_tests.py`
- Command: `uv run python benchmarks/run_tests_strict.py`
- Interpreter: Python 3.11.14
- Measurement date: March 6, 2026
- Dataset size: 74 labeled text cases (`text`, `multilingual`, `edge_cases`)

Measured results:

| Metric | Value |
| --- | --- |
| Cases | 74 |
| Precision | 0.9733 |
| Recall | 1.0000 |
| Avg latency (ms) | 0.2495 |
| P95 latency (ms) | 0.3505 |
| Strict F1 | 0.9865 |

Language coverage in this dataset: EN, ES, PL, PT, PT-BR.

Image precision/recall benchmark is not published yet because labeled image sidecar cases are currently absent (`benchmarks/run_image_tests.py` reports 0 cases).

## Showcase Gallery

Sample assets are included under `examples/inputs/` and `examples/outputs/`.

### Case 1

| Input | Redacted | Results Panel |
| --- | --- | --- |
| ![Input 1](examples/inputs/1.png) | ![Redacted 1](examples/outputs/1_redacted.png) | ![Results 1](examples/outputs/1_results.png) |

### Case 2

| Input | Redacted | Results Panel |
| --- | --- | --- |
| ![Input 2](examples/inputs/2.png) | ![Redacted 2](examples/outputs/2_redacted.png) | ![Results 2](examples/outputs/2_results.png) |

### Case 3

| Input | Redacted | Results Panel |
| --- | --- | --- |
| ![Input 3](examples/inputs/3.png) | ![Redacted 3](examples/outputs/3_redacted.png) | ![Results 3](examples/outputs/3_results.png) |

## Running Local Examples

From repository root:

```bash
uv run python examples/text_example.py --mode quick
uv run python examples/text_example.py --mode detailed
uv run python examples/image_example.py --mode quick
uv run python examples/image_example.py --mode detailed
```

`image_example.py --mode detailed` requires the `image` extra and image inputs in `examples/inputs/`.

## Limitations

- Image benchmark precision/recall is not yet formalized due to missing labeled sidecar cases.
- OCR quality directly affects image-text extraction quality.
- Optional ML detector (`[ml]`) depends on external model/runtime availability.

## Release Notes

### v0.1.0

- First PyPI-ready package layout with flat repository structure.
- Stable public API exports via `upf/__init__.py`.
- Optional extras separated by capability (`image`, `face`, `ml`, `dev`).
- Included reproducible text benchmark dataset and scripts.

## License

Apache License 2.0.
