Metadata-Version: 2.1
Name: detectpii
Version: 0.1.8
Summary: Detect PII columns in your database and warehouse
Home-page: https://github.com/thescalaguy/detectpii
Author: Fasih Khatib
Author-email: hellofasih.confound928@passinbox.com
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Provides-Extra: bigquery
Provides-Extra: hive
Provides-Extra: postgres
Provides-Extra: snowflake
Provides-Extra: trino
Provides-Extra: yugabyte
Requires-Dist: attrs (>=24.2.0,<25.0.0)
Requires-Dist: cattrs (>=23.2.3,<24.0.0)
Requires-Dist: google-cloud-bigquery (>=3.27.0,<4.0.0) ; extra == "bigquery"
Requires-Dist: psycopg2-binary (>=2.9.9,<3.0.0) ; extra == "postgres" or extra == "yugabyte"
Requires-Dist: pyhive[hive-pure-sasl] (>=0.7.0,<0.8.0) ; extra == "hive"
Requires-Dist: snowflake-sqlalchemy (>=1.6.1,<2.0.0) ; extra == "snowflake"
Requires-Dist: sqlalchemy (>=2.0.32,<3.0.0) ; extra == "hive" or extra == "postgres" or extra == "snowflake" or extra == "trino" or extra == "yugabyte"
Requires-Dist: tabulate (>=0.9.0,<0.10.0)
Requires-Dist: trino[sqlalchemy] (>=0.329.0,<0.330.0) ; extra == "trino"
Project-URL: Repository, https://github.com/thescalaguy/detectpii
Description-Content-Type: text/markdown

# 🔍 Detect PII

Detect PII is a library inspired by [piicatcher](https://github.com/tokern/piicatcher) and [CommonRegex](https://github.com/madisonmay/CommonRegex) to detect columns in tables that may potentially contain PII. It does so by performing regex matches 
on column names and column values, flagging the ones that may contain PII.

## Usage

### Installation

Packages can be installed by specifying [extras](https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies), e.g.:

```shell
pip install detectpii[postgres]
```

See all supported [databases and data warehouses](#supported-databases--warehouses).

### Scan tables for PII

```python
from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns

# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
    host="localhost",
    user="postgres",
    password="my-secret-pw",
    database="postgres",
    port=5432,
    schema="public"
)

# -- Create a pipeline to detect PII in tables using an English dictionary
pipeline = PiiDetectionPipeline(
    catalog=pg_catalog,
    scanners=[
        MetadataScanner(),
        DataScanner(),
    ],
    times=1,
    percentage=20,
)

# -- Scan for PII columns.
pii_columns = pipeline.scan()

# -- Print them to the console
print_columns(pii_columns)
```

### Persist the pipeline

```python
import json
from detectpii.pipeline import pipeline_to_dict

# -- Create a pipeline
pipeline = ...

# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)

# -- Print it
print(json.dumps(dictionary, indent=4))

# {
#     "catalog": {
#         "tables": [],
#         "resolver": {
#             "name": "PlaintextResolver",
#             "_type": "PlaintextResolver"
#         },
#         "user": "postgres",
#         "password": "my-secret-pw",
#         "host": "localhost",
#         "port": 5432,
#         "database": "postgres",
#         "schema": "public",
#         "_type": "PostgresCatalog"
#     },
#     "scanners": [
#         {
#             "_type": "MetadataScanner"
#         },
#         {
#             "_type": "DataScanner"
#         }
#     ]
#    "times": 1,
#    "percentage": 10
# }
```

### Load the pipeline

```python
from detectpii.pipeline import dict_to_pipeline

# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...

# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)
```

For more detailed documentation, please see the `docs` folder.

## Supported databases / warehouses

| Database / Warehouse | Package              |
|----------------------|----------------------|
| Hive                 | detectpii[hive]      |
| Postgres             | detectpii[postgres]  |
| Snowflake            | detectpii[snowflake] |
| Trino                | detectpii[trino]     |
| Yugabyte             | detectpii[yugabyte]  |
| BigQuery             | detectpii[bigquery]  |

## Available languages

The following languages are available for metadata detection:

| Language | Detector                         |
|----------|----------------------------------|
| English  | `EnglishColumnNameRegexDetector` |
| Spanish  | `SpanishColumnNameRegexDetector` |
