Metadata-Version: 2.1
Name: legal-pre-processing
Version: 0.3.2
Summary: Pre processing tools for documents with legal content.
Home-page: https://gitlab.com/israel.oliveira.softplan/legal-pre-processing.git
License: MIT
Keywords: legal,text,extract
Author: Daniel Henrique Arruda Boeing
Author-email: daniel.boeing@softplan.com.br
Requires-Python: >=3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: PyMuPDF (>=1.10)
Requires-Dist: Unidecode (>=1.0)
Requires-Dist: nltk (>=3.5)
Project-URL: Repository, https://gitlab.com/israel.oliveira.softplan/legal-pre-processing.git
Description-Content-Type: text/markdown

# Legal Pre-processing

Pre processing tools for documents with legal content.
Authors: [Daniel Henrique Arruda Boeing](mailto:daniel.boeing@softplan.com.br) and [Israel Oliveira](mailto:israel.oliveira@softplan.com.br).

[![Python 3.7](https://img.shields.io/badge/Python-3.7-gree.svg)](https://www.python.org/downloads/release/python-370/)
[![Python 3.8](https://img.shields.io/badge/Python-3.8-gree.svg)](https://www.python.org/downloads/release/python-380/)
[![Python 3.9](https://img.shields.io/badge/Python-3.9-gree.svg)](https://www.python.org/downloads/release/python-390/)

## Usage:

### Donwload the *JSON* files that could be used as examples.

```bash
$ mkdir -p data_dicts && cd data_dicts

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalRegExPatterns.json

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalStopwords.json

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/TesauroRevisado.json
```

### Load helper class and laod dictionaries.

```python
>>> from  legal_pre_processing.utils import LoadDicts
>>>
>>> dicts = LoadDicts('legal_dicts/')
>>> dicts.List
['LegalRegExPatterns', 'TesauroRevisado', 'LegalStopwords']
```

### Load the class LegalPreprocess and and instantiate it.

```python
>>> from legal_pre_processing.legal_pre_processing import LegalPreprocess
>>>
>>> model = LegalPreprocess(domain_stopwords=dicts.LegalStopwords, tesauro=dicts.TesauroRevisado, regex_pattern=dicts.LegalRegExPatterns)
```

### Load a PDF file with [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (or other extractor) and do some tests:

```python
>>> import fitz
>>>
>>> doc = fitz.open('some_pdf_file_with_legal_content.pdf')
>>> page = doc[page_number-1].get_text()
>>> print(page)
"...Com a concordância das partes foi utilizada prova emprestada em relação aos depoimentos de algumas testemunhas de defesa (decisões de 28/10/2016,  07/11/2016, de 10/11/2016 e de 09/02/2017, nos eventos 114, 175 e 199, e depoimentos nos eventos 187, 200, 287 e 513)...."
>>> page_preprocess = model.ProcessText(page)
>>> print(page_preprocess)
"...concordancia utilizada PROVA_EMPRESTADA relacao depoimentos algumas testemunhas defesa decisoes eventos depoimentos eventos..."
```

### Use heuristics available:

```python
>>> from heuristics import Heuristics
>>> path_pdf = 'example-of-rotated-text-in-latex.pdf'
>>> h = Heuristics(path_pdf)
>>> h.set_all_heuristics()
>>> txt = h.Extract()
```

#### Class Heuristics, input parameters:
```python
pdf_path : str
    Path to PDF file.
th_font : float, optional
    Threshold (between 0 and 1) for filter outliers of font types.
    (default is 0.9)
th_size : float, optional
    Threshold (between 0 and 1) for filter outliers of font sizes.
    (default is 0.9)
filter_font_by_cum : bool, optional
    Filters outliers by the accumulated sum, for font types.
    If False, filter by indivual counting. (default is True)
filter_size_by_cum : bool, optional
    Filters outliers by the accumulated sum, for font sizes.
    If False, filter by indivual counting. (default is True)
```

- **Remove duplicated phrases**:
```python
>>> h.set_filter_duplicated_phrases()
```

- **Let only horizontal text**:
```python
>>> h.set_let_horinzontal_text()
```

- **Remove text with more rare used font types**:
```python
>>> h.set_filter_outlier_font_types()
```

- **Remove text with more rare used font sizes**:
```python
>>> h.set_filter_outlier_font_sizes()
```

### TODO:

- Update README with a project's image and shields (see `random-forest-mc`).
- Active [LGTM](https://lgtm.com/) (see `random-forest-mc`).

## Refences:

- [PyMuPDF documentation](https://pymupdf.readthedocs.io/en/latest/index.html) (based on version `1.18.15`).
- [Legal Thesaurus](https://scon.stj.jus.br/SCON/thesaurus/) (*Vocabulário Jurídico
*).

