Metadata-Version: 2.1
Name: cdeid
Version: 0.1.2
Summary: A Customized De-identification framework
Home-page: https://github.com/CBDRH/cdeid
Author: Leibo Liu
Author-email: liuleibo@gmail.com
License: UNKNOWN
Keywords: DE-IDENTIFICATION,NLP
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: spaCy (>=2.3.2)
Requires-Dist: stanza (>=1.1.1)
Requires-Dist: flair (==0.4.5)
Requires-Dist: mako (>=1.1.3)

![cDeid](resources/cdeid_logo_2020.png)

> A framework for training de-identification models to automatically remove protected health information (PHI) from the free text.

cDeid is a customized de-identification method. The users can easily train their own de-identification Models on the 
data sets which are extracted from their own free text corpus. cDeid is based on 3 popular NLP toolkits: [spaCy][spacy], 
[Stanza][stanza] and [FLAIR][flair]. 

## Installation
This project is based on Python 3.7+. Please install it if you do not have. PyTorch is used by FLAIR and Stanza toolkits.
It needs to be installed from [here][pytorch] before you install this project.
```sh
pip install cdeid
```
## Usage example
We are using the [pre-trained word2vec embeddings][word2vec] released from the CoNLL 2017 Shared Task. It is important
to specify the customized PHI types in the corpus otherwise it will cause runtime error during training the models.  
### Using the Python API
#### Train the models
```python
from cdeid.models.trainer import Trainer

phi_types = ['PHONE', 'PERSON', 'ADDRESS', 'IDN', 'DOB']
nlp = Trainer("C:/data", "C:/workspace", phi_types, "C:/wordvec/English/en.vectors.xz")
nlp.train()
```
#### De-identify a sample document
```python
from cdeid.deidentifier.phi_deid import PHIDeid

deider = PHIDeid("C:/workspace", "C:/output")
doc = deider("C:/raw/example.txt")
deider.output(doc)
```
### Using the command line
#### Train the models
```sh
python -m cdeid --command train --workspace C:/workspace --data_dir C:/data --phi_types PHONE PERSON ADDRESS IDN DOB --wordvec_file C:/wordvec/English/en.vectors.xz
```
#### De-identify a sample document
```sh
python -m cdeid --command deid --workspace C:/workspace --deid_output_dir C:/output --deid_file C:/raw/example.txt
```

## Release History

* 0.1.1
    * The first release
* 0.1.2
    * Modify Readme and Setup

## Contributors

Leibo Liu - initial work - [leiboliu](https://github.com/leiboliu/)

## License
[Apache License, Version 2.0](/LICENSE)

<!-- Markdown link & img dfn's -->
[spacy]: https://spacy.io/
[stanza]: https://stanfordnlp.github.io/stanza/
[flair]: https://github.com/flairNLP/flair
[word2vec]:https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/word-embeddings-conll17.tar?sequence=9&isAllowed=y
[pytorch]: https://pytorch.org/

