Metadata-Version: 2.4
Name: healthsecure
Version: 0.2.0a1
Summary: Runtime sensitive data exposure detection SDK
Author: Your Org
Keywords: security,data-leakage,privacy,llm,api,sensitive-data,compliance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.31.0
Requires-Dist: pydantic>=2.0.0

# HealthSecure

**HealthSecure** is a runtime sensitive data exposure detection library for APIs and LLM outputs.

It helps engineering teams detect when **sensitive data** (medical, financial, credentials, personal identifiers) is unintentionally exposed in **production responses**, without sending raw data outside their systems.

HealthSecure analyzes responses **locally**, generates **privacy-preserving risk signals**, and sends only metadata to a centralized risk engine. Raw content is never stored or transmitted.

---

## Why HealthSecure

Modern applications increasingly rely on:

- APIs returning user data

- LLMs generating dynamic responses

- Logs and services emitting free-text output

Sensitive data leaks often happen **at runtime**, after:

- schema design

- code review

- static analysis

- compliance documentation

HealthSecure provides visibility where traditional tools fail:

**in live outputs, not in databases or schemas**.

---

## How It Works

### 1. Local Analysis (SDK)

- Inspects API or LLM responses in memory

- Detects sensitive data patterns using schema-agnostic heuristics

- Discards raw data immediately

### 2. Signal Generation

- Produces a minimal, irreversible signal describing:

  - detected data classes

  - identifiability

  - confidence

  - environment and region

### 3. Risk Evaluation

- Applies deterministic risk policy

- Returns a clear risk level: `LOW`, `MEDIUM`, or `HIGH`

- Backend never processes raw content

---

## Key Principles

- No raw data ingestion

- Schema-agnostic detection

- Deterministic and explainable behavior

- SDK-first design

- Privacy-by-design

- Production-focused

---

## Supported Data Classes

- Medical data

- Financial data

- Credentials (API keys, tokens, secrets)

- Personal identifiers (email, phone)

## v2 Features

- **Confidence Bands**: Automatic mapping of confidence scores to LOW/MEDIUM/HIGH bands
- **Reason Codes**: Explainable detection reasons (MEDICAL_TERM, CREDENTIAL_PATTERN, EMAIL_PATTERN, etc.)
- **Signal Fingerprinting**: Privacy-safe hashing for deduplication
- **Policy Context**: Optional policy configuration (min_confidence, blocked_classes)
- **Execution Context**: Enhanced metadata (channel, mode, sdk_version)

---

## Typical Use Cases

- Detect accidental leakage in LLM responses

- Monitor API responses for sensitive exposure

- Catch credential leaks in logs or service outputs

- Add a runtime safety layer without refactoring systems

---

## What HealthSecure Does NOT Do

- Does not store or transmit raw data

- Does not guarantee regulatory compliance

- Does not interpret business schemas

- Does not perform audits or certifications

- Does not replace encryption or access control

HealthSecure is a **risk detection layer**, not a compliance authority.

---

## Installation

```bash
pip install healthsecure==0.2.0a1
```

---

## Quick Example

```python
from healthsecure import extract_v2_json, build_v2_signal, HealthSecureClient

raw_response = {
    "message": "API token leaked: sk_live_ABC123",
    "status": "error"
}

# v2 extraction with reason codes and confidence bands
extract = extract_v2_json(raw_response)
# Returns: detected_data_classes, identifiers_present, confidence,
#          confidence_band, reasons

# Build v2 signal with enhanced metadata
signal = build_v2_signal(
    extract,
    source="api_response",
    region="EU",
    environment="production"
)
# Signal includes: fingerprint, confidence_band, reasons

client = HealthSecureClient(api_key="YOUR_API_KEY")
result = client.analyze_signal(signal)

print(f"Risk: {result.risk_level}")
print(f"Reasons: {signal.reasons}")
print(f"Confidence Band: {signal.confidence_band}")
```

---

## Killer Demo: Safe LLM Wrapper

**Instant relevance for AI teams** - Wrap your LLM calls to detect sensitive data leakage:

```python
from healthsecure import safe_llm_call

# Wrap any LLM response
response, risk = safe_llm_call(
    "Patient John Doe (john@hospital.com) was diagnosed with HIV"
)

if risk == "HIGH":
    print("⚠️  Sensitive data detected! Blocking response.")
    # Block or sanitize response
else:
    print("✅ Response safe to return")
```

**See full example**: `example-app/safe_llm_call.py`

---

## API Reference

#### `extract_v2_json(raw: Any) -> Dict[str, Any]`

v2 extraction for JSON payloads with reason codes. Returns:

- `detected_data_classes`: List of data classes
- `identifiers_present`: Boolean
- `confidence`: Float (0.0-1.0)
- `confidence_band`: "LOW", "MEDIUM", or "HIGH"
- `reasons`: Sorted list of reason codes

#### `extract_v2_text(text: str) -> Dict[str, Any]`

v2 extraction for text payloads. Same return format as `extract_v2_json`.

#### `build_v2_signal(extract_result: Dict, *, source: str, region: str, environment: str, ...) -> SignalPayload`

Build a v2 SignalPayload from extraction results. Automatically includes:

- Confidence band mapping
- Reason codes
- Privacy-safe fingerprint
- Optional policy and execution context

### Helper Functions

#### `safe_llm_call(llm_response: str, ...) -> Tuple[str, str]`

Wrapper for LLM responses that detects sensitive data leakage. Returns `(response, risk_level)` where risk_level is "LOW", "MEDIUM", or "HIGH".

#### `HealthSecureClient.analyze_signal(payload: SignalPayload) -> SignalResponse`

Send signal to backend for risk assessment. Returns risk level (LOW/MEDIUM/HIGH) and explanation. Accepts both v1 and v2 signals.

---

## Risk Assessment Logic

- **HIGH Risk**:
  - Credentials detected in production (inherently high-risk)
  - Medical/biometric/children data + identifiers in production
- **MEDIUM Risk**:
  - Identifiers present with other sensitive data
  - Sensitive data in staging/development
- **LOW Risk**:
  - No identifiers, no high-risk classes

---

## Detection Limits

### Heuristic-Based Detection

HealthSecure uses **keyword matching and pattern detection**, not machine learning or schema understanding:

- ✅ **High-confidence signals**: Detects obvious sensitive data (credit cards, API keys, medical terms)
- ⚠️ **Best-effort coverage**: May miss context-dependent or obfuscated data
- ⚠️ **False positives possible**: Generic terms may trigger false alarms
- ⚠️ **No semantic understanding**: Cannot distinguish between "patient" (medical) and "patient" (waiting)

### Supported Detection Patterns

- **Medical**: 9 keywords (hiv, cancer, diabetes, diagnosis, treatment, medical, patient, disease, diagnosed)
- **Financial**: 9 keywords (credit, debit, card, iban, account, payment, paid, transaction, billing)
- **Credentials**: 8 keywords (token, api*key, apikey, secret, password, auth, bearer, sk*)
- **Personal Identifiers**: Email addresses (regex), phone numbers (regex)

### What Gets Detected

✅ Credit card numbers (pattern matching)  
✅ API keys (pattern matching: sk*, pk*, api\_)  
✅ Email addresses (regex)  
✅ Medical keywords in text  
✅ Financial keywords in text  
✅ Credential keywords in text

### What May Be Missed

⚠️ Encrypted or encoded data  
⚠️ Context-dependent sensitive information  
⚠️ Industry-specific terminology not in keyword set  
⚠️ Structured data in non-standard formats

---

## Stability & Contracts

**v2 contracts** - See [V2_IMPLEMENTATION.md](V2_IMPLEMENTATION.md) for technical details:

- Extended signal schema with optional v2 fields
- Risk policy unchanged (uses v1 fields for evaluation)
- Documented extractor limitations
- Backward compatible with v1 signals

**v2 features**:

- Confidence bands (LOW/MEDIUM/HIGH)
- Reason codes for explainability
- Privacy-safe signal fingerprinting
- Policy and execution context support
- Backend accepts v2 signals (v2 fields ignored in risk calculation)

## Migration Guide

**Upgrading from v1?** See [MIGRATION.md](MIGRATION.md) for a step-by-step guide.

- ✅ v1 continues to work unchanged
- ✅ No breaking changes
- ✅ Incremental adoption supported

---

## Target Users

- Backend engineers

- AI / LLM engineers

- Platform and security teams

- Startups and SaaS teams running production APIs

---

## Requirements

- Python >= 3.9
- requests >= 2.31.0
- pydantic >= 2.0.0

---

## Project Status

- Version: `0.2.0a1` (alpha)
- SDKs: Python (Node.js planned)
- Backend API: Stable v1 (accepts v2 signals)
- License: _(add license here)_

## What's New in v2.0

- ✨ Enhanced extraction with reason codes
- ✨ Confidence band mapping
- ✨ Privacy-safe signal fingerprinting
- ✨ Policy and execution context support
- ✅ Backward compatible with v1

---

## One-Line Summary

**HealthSecure detects sensitive data leaks in API and LLM outputs at runtime—without ever seeing your data.**

---

## Support

For issues, questions, or contributions, please [open an issue](link-to-issues).
