Metadata-Version: 2.4
Name: nerguard
Version: 1.1.0
Summary: Entropy-gated hybrid NER for privacy-compliant PII detection
Project-URL: Homepage, https://github.com/exdsgift/NerGuard
Project-URL: HuggingFace Model, https://huggingface.co/exdsgift/NerGuard-0.3B
Project-URL: Bug Tracker, https://github.com/exdsgift/NerGuard/issues
Author: Gabriele Durante
License: MIT
License-File: LICENSE
Keywords: gdpr,llm,named-entity-recognition,ner,nlp,pii,privacy,rag,redaction,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: numpy>=1.26
Requires-Dist: ollama>=0.1
Requires-Dist: openai>=1.12
Requires-Dist: python-dotenv>=1.0
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Provides-Extra: all
Requires-Dist: accelerate>=0.27; extra == 'all'
Requires-Dist: datasets>=2.18; extra == 'all'
Requires-Dist: debugpy>=1.8; extra == 'all'
Requires-Dist: evaluate>=0.4; extra == 'all'
Requires-Dist: gliner>=0.2; extra == 'all'
Requires-Dist: ipython>=8.0; extra == 'all'
Requires-Dist: jupyter>=1.0; extra == 'all'
Requires-Dist: langchain-core>=0.2; extra == 'all'
Requires-Dist: matplotlib>=3.8; extra == 'all'
Requires-Dist: onnxruntime>=1.17; extra == 'all'
Requires-Dist: optimum[onnxruntime]>=1.17; extra == 'all'
Requires-Dist: pandas>=2.2; extra == 'all'
Requires-Dist: presidio-analyzer>=2.2; extra == 'all'
Requires-Dist: psutil>=7.2.1; extra == 'all'
Requires-Dist: pydantic>=2.0; extra == 'all'
Requires-Dist: pytest>=8.0; extra == 'all'
Requires-Dist: scikit-learn>=1.4; extra == 'all'
Requires-Dist: scipy>=1.12; extra == 'all'
Requires-Dist: seaborn>=0.13; extra == 'all'
Requires-Dist: seqeval>=1.2; extra == 'all'
Requires-Dist: spacy>=3.7; extra == 'all'
Requires-Dist: tqdm>=4.66; extra == 'all'
Requires-Dist: wandb>=0.16; extra == 'all'
Provides-Extra: benchmark
Requires-Dist: datasets>=2.18; extra == 'benchmark'
Requires-Dist: evaluate>=0.4; extra == 'benchmark'
Requires-Dist: gliner>=0.2; extra == 'benchmark'
Requires-Dist: matplotlib>=3.8; extra == 'benchmark'
Requires-Dist: pandas>=2.2; extra == 'benchmark'
Requires-Dist: presidio-analyzer>=2.2; extra == 'benchmark'
Requires-Dist: psutil>=7.2.1; extra == 'benchmark'
Requires-Dist: scikit-learn>=1.4; extra == 'benchmark'
Requires-Dist: scipy>=1.12; extra == 'benchmark'
Requires-Dist: seaborn>=0.13; extra == 'benchmark'
Requires-Dist: seqeval>=1.2; extra == 'benchmark'
Requires-Dist: spacy>=3.7; extra == 'benchmark'
Requires-Dist: tqdm>=4.66; extra == 'benchmark'
Requires-Dist: wandb>=0.16; extra == 'benchmark'
Provides-Extra: dev
Requires-Dist: debugpy>=1.8; extra == 'dev'
Requires-Dist: ipython>=8.0; extra == 'dev'
Requires-Dist: jupyter>=1.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.2; extra == 'langchain'
Requires-Dist: pydantic>=2.0; extra == 'langchain'
Provides-Extra: quantization
Requires-Dist: accelerate>=0.27; extra == 'quantization'
Requires-Dist: onnxruntime>=1.17; extra == 'quantization'
Requires-Dist: optimum[onnxruntime]>=1.17; extra == 'quantization'
Description-Content-Type: text/markdown

<div align="center">
  <h1>NerGuard</h1>
  <p><strong>Entropy-Gated Hybrid NER for Privacy-Compliant PII Detection</strong></p>
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white" alt="Python"></a>
  <a href="https://pytorch.org/"><img src="https://img.shields.io/badge/PyTorch-2.0+-EE4C2C?style=flat&logo=pytorch&logoColor=white" alt="PyTorch"></a>
  <a href="https://huggingface.co/"><img src="https://img.shields.io/badge/HuggingFace-Transformers-FFD21E?style=flat&logo=huggingface&logoColor=black" alt="HuggingFace"></a>
  <a href="https://ollama.com/"><img src="https://img.shields.io/badge/Ollama-local%20inference-black?style=flat&logo=ollama&logoColor=white" alt="Ollama"></a>
  <a href="https://github.com/astral-sh/uv"><img src="https://img.shields.io/badge/uv-package%20manager-DE5FE9?style=flat&logo=astral&logoColor=white" alt="uv"></a>
  <img src="https://img.shields.io/badge/License-MIT-yellow?style=flat" alt="MIT License">
  <br><br>
  <a href="https://huggingface.co/exdsgift/NerGuard-0.3B">🤗 Model on HuggingFace</a>
  &nbsp;·&nbsp;
  <a href="https://pypi.org/project/nerguard/">📦 PyPI: nerguard</a>
  &nbsp;·&nbsp;
  <a href="https://colab.research.google.com/github/exdsgift/NerGuard/blob/main/scripts/NerGuard_Demo.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"></a>
  <br><br>
</div>

NerGuard is a pre-ingestion privacy layer for RAG pipelines: it detects and redacts PII from text before documents are indexed, keeping sensitive data out of vector databases and LLM context windows. It runs a multilingual mDeBERTa-v3 base model for fast, high-confidence predictions, then selectively routes only uncertain spans to an LLM (OpenAI or local Ollama) for correction, typically less than 3% of tokens. A three-stage regex layer handles structured PII (credit cards, SSNs, IBANs) with deterministic validation. The result is a hybrid pipeline that matches or exceeds larger models on PII recall while remaining GDPR-auditable: every prediction carries its source, confidence score, and routing decision.

<div align="center">
  <img src="https://github.com/user-attachments/assets/2a250234-d7c8-4378-bc06-fd66705ea400" width="800" alt="NerGuard demo">
</div>

## Install

```bash
pip install nerguard
```

The NER model (~300 MB) downloads automatically from HuggingFace on first use.

## Quick start

```python
from nerguard import Redactor

ng = Redactor(
    model_path=None,        # str  — local path or HuggingFace Hub ID for the NER model
    llm_routing=False,      # bool — enable entropy-gated LLM routing
    llm_source="openai",    # str  — "openai" or "ollama"
    llm_model="gpt-4o",     # str  — LLM model name
    api_key=None,           # str  — API key for OpenAI (or None to use OPENAI_API_KEY env var)
    typed=True,             # bool — typed placeholders ([NAME]) vs generic ([PII])
)
result = ng.redact("Hi, I'm John Smith. Email: john@acme.com")

print(result.text)
# "Hi, I'm [NAME] [NAME]. Email: [EMAIL]"

print(result.mapping)
# {"NAME_0": "John", "NAME_1": "Smith", "EMAIL_0": "john@acme.com"}

print(result.entities)
# [{"label": "GIVENNAME", "text": "John", "confidence": 0.998, "source": "base"}, ...]
```

**Batch:**

```python
texts = [
    "Hi, I'm John Smith. Email: john@acme.com",
    "Call me at +1-800-555-0199 or find me on LinkedIn.",
]

results = [ng.redact(t) for t in texts]  # model stays cached across calls

for r in results:
    print(r.text)
# "Hi, I'm [NAME] [NAME]. Email: [EMAIL]"
# "Call me at [PHONE] or find me on LinkedIn."

# Collect all mappings
all_mappings = {k: v for r in results for k, v in r.mapping.items()}
# {"NAME_0": "John", "NAME_1": "Smith", "EMAIL_0": "john@acme.com", "PHONE_0": "+1-800-555-0199"}
```

## LLM routing

Improves recall on ambiguous spans (phone numbers, IDs, dates) by routing uncertain predictions to an LLM.

```python
# Cloud — pass key explicitly or set OPENAI_API_KEY env var
ng = Redactor(llm_routing=True, llm_source="openai", llm_model="gpt-4o", api_key="sk-...")

# Local — no data leaves the machine (requires Ollama)
ng = Redactor(llm_routing=True, llm_source="ollama", llm_model="qwen2.5:7b")
```

## CLI / interactive REPL

```bash
nerguard                                         # interactive REPL
nerguard --file report.txt                       # redact a file
nerguard --llm --backend ollama --model qwen2.5:7b  # with local LLM
nerguard --format rag                            # RAG-optimised output
```

| REPL command | Description |
|---|---|
| `/mode [human\|rag\|json\|generic]` | Switch output format |
| `/llm` | Toggle LLM routing |
| `/backend [openai\|ollama]` | Switch LLM backend |
| `/model NAME` | Set LLM model |
| `/file PATH` | Redact a file |
| `/help` | Show all commands |

## Constructor parameters

```python
Redactor(
    model_path=None,        # str  — local path or HuggingFace Hub ID for the NER model
    llm_routing=False,      # bool — enable entropy-gated LLM routing
    llm_source="openai",    # str  — "openai" or "ollama"
    llm_model="gpt-4o",     # str  — LLM model name
    api_key=None,           # str  — API key for OpenAI (or None to use OPENAI_API_KEY env var)
    typed=True,             # bool — typed placeholders ([NAME]) vs generic ([PII])
)
```

| Parameter | Type | Default | Description |
|---|---|---|---|
| `model_path` | `str` | HuggingFace auto-download | Local filesystem path or HuggingFace Hub ID for the NER model. Omit to download `exdsgift/NerGuard-0.3B` automatically on first use. |
| `llm_routing` | `bool` | `False` | Enable entropy-gated LLM routing. When `True`, spans where the base model is uncertain are re-evaluated by the LLM. Improves recall on ambiguous tokens (phone numbers, dates, IDs) at the cost of extra latency. |
| `llm_source` | `str` | `"openai"` | LLM backend to use when `llm_routing=True`. `"openai"` calls the OpenAI API; `"ollama"` runs inference locally via Ollama (no data leaves the machine). |
| `llm_model` | `str` | `"gpt-4o"` | Model name passed to the selected LLM backend. Examples: `"gpt-4o"`, `"gpt-4o-mini"` for OpenAI; `"qwen2.5:7b"`, `"llama3.1:8b"` for Ollama. Only used when `llm_routing=True`. |
| `api_key` | `str` | `None` | API key for the OpenAI backend. If `None`, falls back to the `OPENAI_API_KEY` environment variable. Ignored when `llm_source="ollama"`. |
| `typed` | `bool` | `True` | Controls placeholder style. `True` → typed placeholders such as `[NAME]`, `[EMAIL]`, `[PHONE]` (preserves semantic context for downstream LLMs). `False` → every entity becomes `[PII]` regardless of type (maximum compression, no semantic signal). |

## RedactResult fields

`ng.redact(text)` returns a `RedactResult` dataclass with three fields:

| Field | Type | Description |
|---|---|---|
| `text` | `str` | Redacted text with placeholders replacing PII spans. |
| `entities` | `list[dict]` | One dict per detected entity, with keys: `label` (entity type), `text` (original value), `start`/`end` (char offsets), `confidence` (0–1), `source` (`"base"` or `"llm"`). |
| `mapping` | `dict[str, str]` | Maps each placeholder instance to its original value, keyed as `"<LABEL>_<index>"` (e.g. `"NAME_0"`, `"EMAIL_0"`). Useful for auditing or selective de-redaction. |

```python
result = ng.redact("Hi, I'm John Smith. Email: john@acme.com")

result.text
# "Hi, I'm [NAME] [NAME]. Email: [EMAIL]"

result.mapping
# {"NAME_0": "John", "NAME_1": "Smith", "EMAIL_0": "john@acme.com"}

result.entities
# [
#   {"label": "GIVENNAME", "text": "John",          "start": 8,  "end": 12, "confidence": 0.998, "source": "base"},
#   {"label": "SURNAME",   "text": "Smith",         "start": 13, "end": 18, "confidence": 0.995, "source": "base"},
#   {"label": "EMAIL",     "text": "john@acme.com", "start": 27, "end": 40, "confidence": 0.991, "source": "base"},
# ]
```

## Detected entity types

`GIVENNAME` · `SURNAME` · `EMAIL` · `TELEPHONENUM` · `SOCIALNUM` · `CREDITCARDNUMBER` · `IBAN` · `PASSPORTNUM` · `IDCARDNUM` · `DRIVERLICENSENUM` · `TAXNUM` · `STREET` · `BUILDINGNUM` · `CITY` · `ZIPCODE` · `DATE` · `TIME` · `AGE` · `SEX` · `TITLE`

## LangChain integration

NerGuard works as a LangChain **DocumentTransformer** and **Tool** out of the box.

```bash
pip install nerguard[langchain]
```

**Anonymize documents in a RAG pipeline:**

```python
from langchain_core.documents import Document
from nerguard.langchain import NerGuardAnonymizer

anonymizer = NerGuardAnonymizer()
docs = [Document(page_content="John Smith's email is john@acme.com")]
anon_docs = anonymizer.transform_documents(docs)

print(anon_docs[0].page_content)
# "John Smith's email is [EMAIL]"

print(anon_docs[0].metadata["nerguard_mapping"])
# {"EMAIL_0": "john@acme.com"}
```

**As a Tool for LangChain agents:**

```python
from nerguard.langchain import NerGuardTool

tool = NerGuardTool()
result = tool.invoke({"text": "Call Alice at +33 6 12 34 56 78"})
# "Call [NAME] at [PHONE]"
```

## Links

- [Model on HuggingFace](https://huggingface.co/exdsgift/NerGuard-0.3B)
- [GitHub](https://github.com/exdsgift/NerGuard)

## License

MIT
