Metadata-Version: 2.4
Name: cloakllm
Version: 0.2.2
Summary: AI Compliance Middleware — PII protection, tamper-evident audit logs, and EU AI Act compliance for LLM gateways
Author-email: The CloakLLM Authors <cloakllm@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/cloakllm/CloakLLM-PY
Project-URL: Issues, https://github.com/cloakllm/CloakLLM-PY/issues
Project-URL: Repository, https://github.com/cloakllm/CloakLLM-PY
Project-URL: Changelog, https://github.com/cloakllm/CloakLLM-PY/blob/main/CHANGELOG.md
Keywords: llm,privacy,pii,compliance,eu-ai-act,litellm,audit,security
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.7.0
Provides-Extra: litellm
Requires-Dist: litellm>=1.40.0; extra == "litellm"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Provides-Extra: all
Requires-Dist: opentelemetry-api>=1.20; extra == "all"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "all"
Dynamic: license-file

# 🛡️ CloakLLM

**Cloak your prompts. Prove your compliance.**

Every prompt you send to an LLM provider is visible in plaintext — names, emails, SSNs, API keys, medical records. CloakLLM intercepts, cloaks, and audits every call.

```
┌──────────────┐    ┌─────────────────────┐     ┌──────────────┐
│   Your App   │───▶│     CLOAKLLM        │───▶│  Claude/GPT  │
│              │    │                     │     │  /Gemini     │
│  "Email      │    │  "Email [PERSON_0]  │     │              │
│   john@..."  │    │   [EMAIL_0]..."     │     │  Never sees  │
│              │◀───│                     │◀───│  real data   │
└──────────────┘    └─────────────────────┘     └──────────────┘
                          │
                    ┌─────────────┐
                    │  Hash-Chain  │
                    │  Audit Log   │
                    │  (EU AI Act) │
                    └─────────────┘
> **Also available for JavaScript/TypeScript:** `npm install cloakllm` — zero dependencies, OpenAI SDK integration. See [CloakLLM JS](https://github.com/cloakllm/CloakLLM-JS). | [Project Hub](https://github.com/cloakllm/CloakLLM)
```

## ⏰ Why Now?

**EU AI Act enforcement begins August 2, 2026.** Article 12 requires tamper-evident audit logs that regulators can mathematically verify. Non-compliance: up to **7% of global annual revenue**.

Your current logging (`logger.info()`) won't survive an audit. CloakLLM provides:

- 🔒 **PII Detection** — Names, emails, SSNs, API keys, IPs, credit cards, IBANs via NER + regex
- 🎭 **Context-Preserving Cloaking** — `John Smith` → `[PERSON_0]` (the LLM still understands the prompt)
- ⛓️ **Tamper-Evident Audit Chain** — Every event hash-linked. Any tampering breaks the chain.
- ⚡ **One-Line Middleware** — Drop-in protection for OpenAI SDK and LiteLLM (100+ providers)

## 🚀 Quick Start

### Install

```bash
pip install cloakllm                  # standalone usage
pip install cloakllm[litellm]         # with LiteLLM integration
python -m spacy download en_core_web_sm
```

### Option A: With OpenAI SDK (one line)

```python
from cloakllm import enable_openai
from openai import OpenAI

client = OpenAI()
enable_openai(client)  # Done. All calls are now cloaked.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Email john@acme.com about Project X"}]
)
# Provider never sees "john@acme.com" — only "[EMAIL_0]"
# Response is automatically uncloaked before you see it
```

### Option B: With LiteLLM (one line)

```python
import cloakllm
cloakllm.enable()  # Done. All LiteLLM calls are now cloaked.

import litellm
response = litellm.completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Email john@acme.com about Project X"}]
)
# Provider never sees "john@acme.com" — only "[EMAIL_0]"
# Response is automatically uncloaked before you see it
```

### Option C: Standalone

```python
from cloakllm import Shield

shield = Shield()

# Cloak
cloaked, token_map = shield.sanitize(
    "Send report to john@acme.com, SSN 123-45-6789"
)
# cloaked: "Send report to [EMAIL_0], SSN [SSN_0]"

# ... send cloaked prompt to any LLM ...

# Uncloak response
clean = shield.desanitize(llm_response, token_map)
```

### Redaction Mode (irreversible)

```python
from cloakllm import Shield, ShieldConfig

shield = Shield(ShieldConfig(mode="redact"))
redacted, _ = shield.sanitize("Email john@acme.com about Sarah Johnson")
# redacted: "Email [EMAIL_REDACTED] about [PERSON_REDACTED]"
# No token map stored — cannot be reversed
```

### Entity Details (compliance metadata)

```python
from cloakllm import Shield

shield = Shield()
sanitized, token_map = shield.sanitize("Email john@acme.com, SSN 123-45-6789")

# Per-entity metadata (no original text — PII-safe)
token_map.entity_details
# [
#   {"category": "EMAIL", "start": 6, "end": 19, "length": 13, "confidence": 0.95, "source": "regex", "token": "[EMAIL_0]"},
#   {"category": "SSN", "start": 25, "end": 36, "length": 11, "confidence": 0.95, "source": "regex", "token": "[SSN_0]"}
# ]

# Full report for dashboards
token_map.to_report()
# {"entity_count": 2, "categories": {...}, "tokens": [...], "mode": "tokenize", "entity_details": [...]}
```

### Option D: CLI

```bash
# Scan text for sensitive data
python -m cloakllm scan "Email john@acme.com, SSN 123-45-6789"

# Verify audit chain integrity
python -m cloakllm verify ./cloakllm_audit/

# View audit statistics
python -m cloakllm stats ./cloakllm_audit/
```

## ⛓️ Tamper-Evident Audit Chain

Every cloaking event is recorded in a hash-chained append-only log:

```json
{
  "seq": 42,
  "event_id": "a1b2c3d4-...",
  "timestamp": "2026-02-27T14:30:00+00:00",
  "event_type": "sanitize",
  "model": "claude-sonnet-4-20250514",
  "entity_count": 3,
  "categories": {"PERSON": 1, "EMAIL": 1, "SSN": 1},
  "tokens_used": ["[PERSON_0]", "[EMAIL_0]", "[SSN_0]"],
  "prompt_hash": "sha256:9f86d0...",
  "sanitized_hash": "sha256:a3f2b1...",
  "entity_details": [
    {"category": "PERSON", "start": 0, "end": 10, "length": 10, "confidence": 0.85, "source": "spacy", "token": "[PERSON_0]"},
    {"category": "EMAIL", "start": 12, "end": 25, "length": 13, "confidence": 0.95, "source": "regex", "token": "[EMAIL_0]"},
    {"category": "SSN", "start": 27, "end": 38, "length": 11, "confidence": 0.95, "source": "regex", "token": "[SSN_0]"}
  ],
  "latency_ms": 4.2,
  "prev_hash": "sha256:7c4d2e...",
  "entry_hash": "sha256:b5e8f3..."
}
```

**Chain verification:**
```bash
python -m cloakllm verify ./cloakllm_audit/
# ✅ Audit chain integrity verified — no tampering detected.
```

If anyone modifies a single entry, every subsequent hash breaks:
```
Entry #40 ✅ → #41 ✅ → #42 ❌ TAMPERED → #43 ❌ BROKEN → ...
```

This is what EU AI Act Article 12 requires.

## ⚙️ Configuration

```python
from cloakllm import Shield, ShieldConfig

shield = Shield(config=ShieldConfig(
    # Detection
    spacy_model="en_core_web_lg",       # Larger model = better accuracy
    detect_emails=True,
    detect_phones=True,
    detect_api_keys=True,
    custom_patterns=[                    # Your own regex patterns
        ("PROJECT_CODE", r"PRJ-\d{4}-\w+"),
        ("INTERNAL_ID", r"EMP-\d{6}"),
    ],

    # Audit
    log_dir="./compliance_audit",
    log_original_values=False,           # Never log original PII

    # Middleware
    skip_models=["ollama/", "local/"],   # Don't cloak local model calls
))
```

**LLM Detection (opt-in)** — uses a local Ollama instance to catch semantic PII (addresses, medical info, etc.):

```python
shield = Shield(config=ShieldConfig(
    llm_detection=True,                  # Enable LLM-based detection
    llm_model="llama3.2",               # Ollama model to use
    llm_ollama_url="http://localhost:11434",  # Ollama endpoint
    llm_timeout=10.0,                   # Timeout in seconds
    llm_confidence=0.85,                # Confidence score for LLM detections
))
```

Environment variables:
```bash
CLOAKLLM_LOG_DIR=./audit
CLOAKLLM_SPACY_MODEL=en_core_web_sm
CLOAKLLM_OTEL_ENABLED=true
CLOAKLLM_LLM_DETECTION=true
CLOAKLLM_LLM_MODEL=llama3.2
CLOAKLLM_OLLAMA_URL=http://localhost:11434
```

## 🔍 What Gets Detected

| Category | Examples | Method |
|----------|----------|--------|
| `PERSON` | John Smith, Sarah Johnson | spaCy NER |
| `ORG` | Acme Corp, Google | spaCy NER |
| `GPE` | New York, Israel | spaCy NER |
| `EMAIL` | john@acme.com | Regex |
| `PHONE` | +1-555-0142, 050-123-4567 | Regex |
| `SSN` | 123-45-6789 | Regex |
| `CREDIT_CARD` | 4111111111111111 | Regex |
| `IP_ADDRESS` | 192.168.1.100 | Regex |
| `API_KEY` | sk-abc123..., AKIA... | Regex |
| `IBAN` | DE89370400440532013000 | Regex |
| `JWT` | eyJhbGciOi... | Regex |
| Custom | Your patterns | Regex |
| `ADDRESS` | 742 Evergreen Terrace | LLM (Local) |
| `DATE_OF_BIRTH` | 1990-01-15 | LLM (Local) |
| `MEDICAL` | diabetes mellitus | LLM (Local) |
| `FINANCIAL` | account 4521-XXX | LLM (Local) |
| `NATIONAL_ID` | TZ 12345678 | LLM (Local) |
| `BIOMETRIC` | fingerprint hash | LLM (Local) |
| `USERNAME` | @johndoe42 | LLM (Local) |
| `PASSWORD` | P@ssw0rd123 | LLM (Local) |
| `VEHICLE` | plate ABC-1234 | LLM (Local) |

## 🗺️ Roadmap

- [x] PII detection (NER + regex)
- [x] Deterministic tokenization
- [x] Hash-chain audit logging
- [x] LiteLLM middleware integration
- [x] OpenAI SDK middleware integration
- [x] CLI tool
- [x] Redaction / scrubbing mode
- [x] Field-level PII metadata (entity_details)
- [ ] OpenTelemetry span emission (with auto-redaction)
- [ ] RFC 3161 trusted timestamping
- [ ] Signed audit snapshots
- [ ] MCP security gateway (tool validation, permission enforcement)
- [x] Local LLM detection (opt-in, via Ollama)
- [ ] Sensitivity-based routing (PII → local model, general → cloud)
- [ ] Admin dashboard
- [ ] EU AI Act conformity report generator

## 📜 License

MIT

## 🤝 Contributing

PRs welcome. Highest-impact areas:
1. **Non-English NER** — Hebrew, Arabic, Chinese PII detection
2. **De-tokenization accuracy** — handling LLM paraphrasing
3. **OpenTelemetry integration** — GenAI semantic conventions
4. **MCP security** — tool validation middleware

---

**Built for the EU AI Act deadline. Ships before the auditors do.**
