Metadata-Version: 2.1
Name: ner-anonymizer
Version: 0.1.2
Summary: Anonymizes pandas dataset and provides a hash dictionary to de-anonymize
Home-page: UNKNOWN
Author: Kelvin Tay
Author-email: btkelvin@gmail.com
License: MIT license
Platform: UNKNOWN
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
Requires-Dist: transformers (>=3.0.0)
Requires-Dist: torch (>=1.5.0)
Requires-Dist: torchvision (>=0.6.0)
Requires-Dist: pandas (>=1.0.0)

# NER Anonymizer
This package contains some developmental tools to anonymize a pandas dataframe.

NER Anonymizer contains a class `DataAnonymizer` which handles anonymization for both free text and categorical columns in a pandas dataframe:
* For free text columns, it uses a pretrained model from the [transformers](https://huggingface.co/transformers/) package to perform named entity recognition (NER) to pick up user specified entities such as location and person, generate a MD5 hash for the entity, replaces the entity with the hash, and stores the hash to entity in a dictionary
* For categorical columns, it simply generates a MD5 hash for every category, replaces the category with the hash, and stores the hash to category in a dictionary

The saved dictionary can then be used for de-anonymization and the original dataset is obtained. Referential integrity is preserved as the same hash will be generated for the same category / entity.

## Installation
Install the package with pip

    pip install ner-anonymizer

## Example Usage
The package uses the NER model [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) by default. To anonymize a particular pandas dataframe, `df`, using a pretrained NER model:

    import ner_anonymizer

    # to anonymize
    anonymizer = ner_anonymizer.DataAnoynmizer(df)
    anonymized_df, hash_dictionary = anonymizer.anonymize(
        free_text_columns=["free_text_column_1", "free_text_column_2"],
        categorical_columns=["categorical_column_1"],
        pretrained_model_name="dslim/bert-base-NER",
        label_list=["O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"], # list of labels used in the specified pretrained NER model
        labels_to_anonymize=["B-PER", "I-PER", "B-LOC", "I-LOC"] # list of labels to anonymize
    )

    # to de-anonymize
    de_anonymized_df = ner_anonymizer.de_anonymize_data(df, hash_dictionary)

You may specify for the argument `pretrained_model_name` any available pre-trained NER model from the [transformers](https://huggingface.co/transformers/) package in the links below (do note that you will need to specify the labels that the NER model uses, `label_list`, and from that list, the labels to be anonymized, `labels_to_anonymize`):
* https://huggingface.co/transformers/pretrained_models.html
* https://huggingface.co/models

You may also view an example notebook in the following directory `examples/example_usage.ipynb`.


