Metadata-Version: 2.4
Name: cloakdf
Version: 0.0.3
Summary: Pseudonymisation/anonymisation engine with encrypted mapping storage
Author-email: frangs <FGiordano-Silva@lambeth.gov.uk>
License: Apache-2.0
Project-URL: Repository, https://github.com/giordafrancis/cloakdf
Project-URL: Documentation, https://giordafrancis.github.io/cloakdf/
Keywords: nbdev
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cryptography
Requires-Dist: pandas
Dynamic: license-file

# cloakdf


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

    pip install cloakdf

## How to use

We’ll demonstrate `cloakdf` using the [Northwind
dataset](https://github.com/neo4j-contrib/northwind-neo4j). A classic
sample database with customers, orders, and employees tables that share
keys across them.

``` python
#| eval: false
import httpx
from pathlib import Path

base = "https://raw.githubusercontent.com/neo4j-contrib/northwind-neo4j/refs/heads/master/data"
files = ["customers.csv", "orders.csv", "employees.csv"]
data = Path("../data")
data.mkdir(exist_ok=True)

for f in files:
    (data/f).write_bytes(httpx.get(f"{base}/{f}").content)
```

### 1. Define your table configuration

Each table’s config specifies two kinds of columns:

- **`id` groups** (e.g. `id1`, `id2`) — key columns that are shared
  across tables. Columns in the same group get **consistent
  pseudonyms**: `customerID` in both `customers` and `orders` maps to
  the same UUID, preserving referential integrity.

**Note:** All `id` columns must be string dtype before encoding — cast
with `.astype(str)` if needed. - **`mask`** — sensitive columns to
replace with opaque hex tokens. These are stored in a shared vault, so
identical values (e.g. the same address appearing twice) get the same
token.

``` python
tables = {
    'customers': {
        'id1': 'customerID',
        'mask': ['contactName', 'address', 'phone', 'companyName', 'fax', 'city']
    },
    'orders': {
        'id1': 'customerID',
        'id2': 'employeeID',
        'mask': ['shipName', 'shipAddress']
    },
    'employees': {
        'id2': 'employeeID',
        'mask': ['firstName', 'lastName', 'address', 'homePhone', 'birthDate', 'notes']
    },
}
```

### 2. Encode your DataFrames

``` python
import pandas as pd
from pathlib import Path

data = Path("../data")
unk = CloakDF(tables)

originals, encoded = {}, {}
for name in tables:
    df = pd.read_csv(data/f"{name}.csv", on_bad_lines='skip')
    for k, v in tables[name].items():
        if k.startswith('id'): df[v] = df[v].astype(str)
    originals[name] = df
    encoded[name] = unk.encode(name, df)

encoded['customers'].head(3)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | customerID | companyName | contactName | contactTitle | address | city | region | postalCode | country | phone | fax |
|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | 9eb2c3d4-32d3-434c-863c-44d8ac4067d9 | 9288db060f26 | 5c1dfdfc8f62 | Sales Representative | 80e2676376a1 | c1f7bd6a22f0 | NaN | 12209 | Germany | 715e53fda07f | cd0efa7df34e |
| 1 | 82d494d5-a9e7-4690-b5ec-3104dad5b87d | 18db1aa91f8c | 3c4420341c1a | Owner | 135d66291b39 | 42d81129a68e | NaN | 05021 | Mexico | a5611db2c349 | 25fabc23caa4 |
| 2 | fa465d8d-ead0-4d03-ae91-8a65715c7fd1 | 9458f0d21f98 | 08b3087d3dda | Owner | a93f1f586e03 | 42d81129a68e | NaN | 05023 | Mexico | 2a648576b9be | NaN |

</div>

### 3. Save encrypted mappings

The mapping dictionaries (`key_maps` and `vault`) are the sensitive
artefacts — they allow de-anonymisation. Generate a Fernet key and
encrypt them at rest. You can store the key to a file or an environment
variable:

> ⚠️ **Never commit your Fernet key or encrypted mappings to version
> control.** Ensure your `.gitignore` includes key files and `*.enc`
> files (e.g. `*.key`, `*.enc`, `data/`).

``` python
import os

key = CloakDF.generate_key()
unk.save(data/"mappings.enc", key)

# Optionally store the key in an environment variable
os.environ['CLOAKDF_KEY'] = key.decode()
print("✓ Mappings saved and key stored in env var")
```

    ✓ Mappings saved and key stored in env var

### 4. Load and decode

Load the encrypted mappings (using the key directly or from an
environment variable) and reverse the encoding:

``` python
# Load key from env var (or pass `key` directly)
loaded_key = CloakDF.load_key(env_var='CLOAKDF_KEY')
# loaded_key = CloakDF.load_key(path="path/to/keyfile")  # alternative: from file

unk2 = CloakDF.load(data/"mappings.enc", loaded_key, tables)

for name in tables:
    decoded = unk2.decode(name, encoded[name])
    pd.testing.assert_frame_equal(decoded, originals[name])
    print(f"✓ {name} round-trip OK")

decoded.head(3)
```

    ✓ customers round-trip OK
    ✓ orders round-trip OK
    ✓ employees round-trip OK

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | employeeID | lastName | firstName | title | titleOfCourtesy | birthDate | hireDate | address | city | region | postalCode | country | homePhone | extension | photo | notes | reportsTo | photoPath |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | 1 | Davolio | Nancy | Sales Representative | Ms. | 1948-12-08 00:00:00.000 | 1992-05-01 00:00:00.000 | 507 - 20th Ave. E. Apt. 2A | Seattle | WA | 98122 | USA | \(206\) 555-9857 | 5467 | 0x151C2F00020000000D000E0014002100FFFFFFFF4269... | Education includes a BA in psychology from Col... | 2.0 | http://accweb/emmployees/davolio.bmp |
| 1 | 2 | Fuller | Andrew | Vice President, Sales | Dr. | 1952-02-19 00:00:00.000 | 1992-08-14 00:00:00.000 | 908 W. Capital Way | Tacoma | WA | 98401 | USA | \(206\) 555-9482 | 3457 | 0x151C2F00020000000D000E0014002100FFFFFFFF4269... | Andrew received his BTS commercial in 1974 and... | NaN | http://accweb/emmployees/fuller.bmp |
| 2 | 3 | Leverling | Janet | Sales Representative | Ms. | 1963-08-30 00:00:00.000 | 1992-04-01 00:00:00.000 | 722 Moss Bay Blvd. | Kirkland | WA | 98033 | USA | \(206\) 555-3412 | 3355 | 0x151C2F00020000000D000E0014002100FFFFFFFF4269... | Janet has a BS degree in chemistry from Boston... | 2.0 | http://accweb/emmployees/leverling.bmp |

</div>

### 5. Compare encoded vs decoded

Here’s what the employees table looks like — encoded (pseudonymised) vs
decoded (original):

``` python
cols = ['employeeID', 'firstName', 'lastName', 'title', 'address', 'homePhone']
display(encoded['employees'][cols].head(3))
display(decoded[cols].head(3))
```
