Metadata-Version: 2.4
Name: maskingengine
Version: 1.1.0
Summary: Local-first PII redaction for LLM integration - mask before AI processing, restore after. Uses default patterns + multilingual NER, no network calls.
Home-page: https://github.com/foofork/maskingengine
Author: MaskingEngine Team
Author-email: MaskingEngine Team <contact@maskingengine.dev>
License-Expression: MIT
Project-URL: Homepage, https://github.com/foofork/maskingengine
Project-URL: Documentation, https://github.com/foofork/maskingengine/blob/main/docs/README.md
Project-URL: Repository, https://github.com/foofork/maskingengine
Project-URL: Bug Tracker, https://github.com/foofork/maskingengine/issues
Keywords: pii,privacy,redaction,llm,openai,claude,gpt,langchain,local-first,multilingual,ner,regex,ai-pipelines
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Security
Classifier: Topic :: Text Processing
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Requires-Dist: transformers>=4.21.0
Requires-Dist: torch>=1.12.0
Requires-Dist: click>=8.0.0
Requires-Dist: fastapi>=0.68.0
Requires-Dist: uvicorn>=0.15.0
Requires-Dist: pydantic>=1.8.0
Requires-Dist: requests>=2.25.0
Requires-Dist: jsonschema>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.910; extra == "dev"
Requires-Dist: pre-commit>=2.15; extra == "dev"
Provides-Extra: api
Requires-Dist: fastapi>=0.68.0; extra == "api"
Requires-Dist: uvicorn>=0.15.0; extra == "api"
Provides-Extra: minimal
Requires-Dist: pyyaml>=6.0; extra == "minimal"
Requires-Dist: click>=8.0.0; extra == "minimal"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# MaskingEngine

[![PyPI version](https://badge.fury.io/py/maskingengine.svg)](https://pypi.org/project/maskingengine/)
[![Python Support](https://img.shields.io/pypi/pyversions/maskingengine.svg)](https://pypi.org/project/maskingengine/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Privacy-first, blazing-fast PII redaction for AI pipelines.

MaskingEngine is a local-first, multilingual PII sanitizer built for AI applications, logs, and data workflows. It detects and masks emails, phone numbers, names, IDs, and more before text is sent to large language models or stored in logs.

**Input:**   `Contact John Smith at john@example.com or call 555-123-4567`  
**Output:**  `Contact John Smith at <<EMAIL_7A9B2C_1>> or call <<PHONE_4D8E1F_1>>`

## 🚀 Features

* 🧠 **Multilingual NER** — DistilBERT model for contextual PII detection in 100+ languages
* ⚡ **Regex-only mode** — No model loading, <50ms masking for structured PII
* 🧩 **YAML pattern packs** — Easily extend detection for your org or domain
* 📋 **Configuration profiles** — Pre-built configs for healthcare, finance, legal (v1.01.00+)
* 🔧 **Modular architecture** — Drop-in config validation, streaming support, model registry
* 💬 **Format-aware** — Preserves structure in JSON, HTML, plain text
* 🔐 **Fully local** — No network calls, no telemetry, production-ready
* 🔁 **Optional Rehydration** — Restore original PII when needed (most use cases don't need this)
* 🔧 **CLI, REST API, SDK** — Drop into LangChain, Python pipelines, or microservices

## 🛠 Installation

### From PyPI (Recommended)
```bash
pip install maskingengine
```

### Installation Options
```bash
# Default: Full capabilities (default patterns + multilingual NER)
pip install maskingengine

# Minimal: Regex-only detection (faster install, no ML models)
pip install maskingengine[minimal]

# API server: Adds REST API capabilities  
pip install maskingengine[api]

# Development: Adds testing and code quality tools
pip install maskingengine[dev]
```

### From Source
```bash
# Clone the repository
git clone https://github.com/foofork/maskingengine.git
cd maskingengine

# Install in development mode
pip install -e .
```

### Requirements
- Python 3.8+
- Dependencies are automatically installed with pip

### Quick Installation Test
```bash
# Test CLI
echo "Email: test@example.com" | maskingengine mask --stdin --regex-only

# Test Python SDK
python -c "from maskingengine import Sanitizer; print('✅ Installation successful!')"
```

## 🚀 Quick Start

```bash
# CLI (Regex-only mode)
echo "Email john@example.com or call 555-123-4567" | maskingengine mask --stdin --regex-only
```

```python
# Python usage
from maskingengine import Sanitizer

sanitizer = Sanitizer()
masked, mask_map = sanitizer.sanitize("Email john@example.com")
print(masked)
# => "Email <<EMAIL_7A9B2C_1>>"

# mask_map contains original values for optional restoration
# Most use cases just use 'masked' and discard 'mask_map'
```

## 🔎 What It Detects

### Built-in (Regex-based)

| Type | Example | Global Support |
|------|---------|----------------|
| Email | `john@example.com` | ✅ Universal |
| Phone | `+1 555-123-4567` | ✅ US/EU/Intl |
| IP Address | `192.168.1.1` | ✅ IPv4/IPv6 |
| Credit Card | `4111-1111-1111-1111` | ✅ Luhn-validated |
| SSN | `123-45-6789` | 🇺🇸 US only |
| ID Numbers | `X1234567B, BSN, INSEE` | 🇪🇸 🇳🇱 🇫🇷 etc. |

### NER-based (DistilBERT model)

| Type | Example | Languages |
|------|---------|-----------|
| Email | `john@example.com` | Multilingual |
| Phone | `555-123-4567` | Multilingual |
| Social Numbers | `123-45-6789` | Multilingual |

*Note: NER model complements regex patterns and excels at contextual detection*

## 🧩 Pattern Packs

Define your own redaction rules using YAML:

```yaml
# patterns/custom.yaml
name: "custom"
description: "Enterprise-specific patterns"
version: "1.0.0"

patterns:
  - name: EMPLOYEE_ID
    description: "Employee ID numbers"
    tier: 1
    language: "universal"
    patterns:
      - '\bEMP\d{6}\b'
```

Then load:
```python
from maskingengine import Config, Sanitizer
config = Config(pattern_packs=["default", "custom"])
sanitizer = Sanitizer(config)
```

## 📄 Input Formats

```python
# JSON - structure preserved
result, mask_map = sanitizer.sanitize({"email": "jane@company.com"}, format="json")

# HTML - tags preserved
html = '<a href="mailto:john@example.com">Email</a>'
result, mask_map = sanitizer.sanitize(html, format="html")

# Plain text - auto-detected
text = "Contacta a María García en maria@empresa.es"  
result, mask_map = sanitizer.sanitize(text)
```

## ⚙️ Configuration Options

```python
config = Config(
    regex_only=True,                    # Speed mode (no NER)
    pattern_packs=["default", "custom"], # Load specific pattern packs
    whitelist=["support@company.com"],   # Terms to exclude from masking
    min_confidence=0.9,                 # NER confidence threshold
    strict_validation=True              # Enable validation (Luhn check, etc.)
)
sanitizer = Sanitizer(config)
```

## 🖥 REST API

Start the API server:
```bash
python scripts/run_api.py
# API available at http://localhost:8000
# Interactive docs at http://localhost:8000/docs
```

Example usage:
```bash
curl -X POST http://localhost:8000/sanitize \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Email john@example.com",
    "format": "text",
    "regex_only": true
  }'
```

## 💡 Framework Integration Examples

```python
# LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from maskingengine import Sanitizer
sanitizer = Sanitizer()

class PrivacyTextSplitter(RecursiveCharacterTextSplitter):
  def split_text(self, text):
    masked, _ = sanitizer.sanitize(text)
    return super().split_text(masked)

# Pandas
import pandas as pd
from maskingengine import Sanitizer
sanitizer = Sanitizer()
df["message"] = df["message"].apply(lambda x: sanitizer.sanitize(str(x))[0])
```

## 🧪 Performance Modes

| Mode | Speed | Accuracy | Use Case |
|------|-------|----------|----------|
| Regex-only | <50ms | High for structured PII | Logs, structured data |
| NER + Regex | <200ms* | Highest | Unstructured text, contextual |
| Custom patterns | <100ms | Domain-specific | Enterprise rules |
| Streaming (v1.01.00+) | Efficient for large files | Same as base mode | Large documents, batch processing |

*Note: First NER run includes ~8s model loading time. Subsequent runs are <200ms.*

### Streaming Support (v1.01.00+)
```python
from maskingengine.pipeline import StreamingMaskingSession

# Process large files efficiently
session = StreamingMaskingSession(config, chunk_size=4096)
for chunk in large_file_chunks:
    masked_chunk = session.process_chunk(chunk)
    # Process masked chunk
```

## 📦 CLI Usage

### Basic Masking
```bash
# Regex-only (fastest)
maskingengine mask input.txt --regex-only -o output.txt

# Using configuration profiles
maskingengine mask input.txt --profile healthcare-en -o output.txt

# Multiple pattern packs
maskingengine mask input.txt --pattern-packs default --pattern-packs healthcare -o output.txt

# From stdin
echo "Call 555-123-4567" | maskingengine mask --stdin --regex-only
```

### Getting Started (v1.01.00+)
```bash
# Interactive getting started guide
maskingengine getting-started

# Or jump right to discovery
maskingengine list-profiles      # See available profiles with recommendations
maskingengine test-sample "Patient ID: 123-45-6789" --profile healthcare-en
```

### Configuration & Discovery
```bash
# Validate configuration with profiles
maskingengine validate-config --profile healthcare-en
maskingengine validate-config config.yaml

# Test sample text with different profiles
maskingengine test-sample "Email: john@example.com" --regex-only

# Discover available resources
maskingengine list-models        # Available NER models
maskingengine list-packs         # Available pattern packs
maskingengine list-profiles      # Configuration profiles with usage guidance
```

### Configuration Profiles
- **minimal** - Regex-only mode for basic PII types
- **standard** - Balanced regex + NER detection  
- **healthcare-en** - HIPAA-focused patterns for healthcare
- **finance-en** - Financial PII patterns (SSN, credit cards)
- **high-security** - Maximum detection with strict validation

## 📚 Documentation

### Core Guides
* **[Workflow Guide](docs/workflows.md)** - Visual workflow diagrams and decision guide
* **[API Reference](docs/api.md)** - Complete REST API documentation
* **[Features Overview](docs/features.md)** - Comprehensive feature documentation  
* **[Usage Examples](docs/examples.md)** - Python, CLI, API, and framework examples
* **[Architecture Overview](docs/architecture.md)** - System design and components

### Customization & Advanced Usage
* **[Custom Pattern Packs](docs/patterns.md)** - Create organization-specific PII patterns
* **[Pattern Sourcing Guide](docs/sourcing.md)** - Guidelines for developing and maintaining pattern packs
* **[Security Best Practices](docs/security.md)** - Comprehensive security guidance and compliance recommendations
* **[Performance & Production](docs/architecture.md#performance-architecture)** - Scaling and deployment guidance

### Getting Started
* **[Quick Start](#-quick-start)** - Basic usage examples
* **[Installation](#installation)** - Setup instructions
* **[CLI Usage](#-cli-usage)** - Command-line interface guide

## 🔁 Rehydration System

**Rehydration is completely optional** — most use cases only need sanitization for permanent PII removal (logs, analytics, training data).

For AI pipeline integration, MaskingEngine can restore original PII after LLM processing:

```python
from maskingengine import RehydrationPipeline, Sanitizer, RehydrationStorage

# Setup pipeline
sanitizer = Sanitizer()
storage = RehydrationStorage()
pipeline = RehydrationPipeline(sanitizer, storage)

# Step 1: Mask before sending to LLM
masked_content, storage_path = pipeline.sanitize_with_session(
    "Contact john@example.com about the project", 
    session_id="user_123"
)

# Step 2: Send masked_content to LLM
llm_response = llm.process(masked_content)

# Step 3: Restore original PII in response
final_response = pipeline.rehydrate_with_session(llm_response, "user_123")
```

### Common Workflows:
* ✅ **Sanitize-only**: Logs, analytics, training data (no rehydration needed)
* 🔄 **Round-trip**: AI pipelines where you restore PII in responses

📖 **[Complete examples and patterns →](docs/examples.md#session-based-workflow)**

## 🤝 Contributing

1. Fork and clone
2. Add tests for new features
3. Submit a PR with a clear description

We welcome contributors from privacy, AI, and data tooling backgrounds.

## 🔐 License

MIT License. Fully open-source and local-first — no cloud APIs required.
