Metadata-Version: 2.1
Name: pdfner
Version: 0.1.1
Summary: Information extraction and named-entity recognition for indexing PDFs
Home-page: https://github.com/hnluu8/pdfner
Author: Hung Luu
Author-email: hung.luu@outlook.com
License: MIT
Keywords: pdf, ner
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: <3.8,>=3.6
Description-Content-Type: text/markdown; charset=UTF-8
Requires-Dist: dask[delayed] (==2.9.0)
Requires-Dist: simplejson (==3.17.0)
Requires-Dist: pdf2image (==1.4.2)
Requires-Dist: pymupdf (==1.16.8)
Requires-Dist: pikepdf (==1.1.0)
Requires-Dist: img2pdf (==0.3.3)
Requires-Dist: textacy (==0.6.2)
Requires-Dist: msgpack-python (==0.5.6)
Requires-Dist: pynlp (==0.4.2)
Provides-Extra: test
Requires-Dist: pytest (>=5.0.0) ; extra == 'test'
Requires-Dist: pytest-cov ; extra == 'test'

# pdfner
Information extraction and named entity recognition for indexing PDFs

## Install NLP tools
1. Download language-specific model data in spaCy
    ```
        $ python -m spacy download en
    ```
2. Download Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/download.html and extract to {project root}/pdfner/tests/tools


## Install OCRmyPDF
https://ocrmypdf.readthedocs.io/en/latest/installation.html

## Installation
```commandline
pip install pdfner
```

## Usage
### Processing a PDF
```python
from typing import List
from pdfner import *

# Each page of the PDF is processed to an NerDocument.
processed_pdf: List[NerDocument] = process_pdf('scanned.pdf', entities_detector=SpacyDetectEntities())
print(f"Extracted text: {processed_pdf[0].text}")
print(f"Detected entities: {processed_pdf[0].entities}") 
```

### Indexing with Elasticsearch
```python
import simplejson as json
from elasticsearch import Elasticsearch
es = Elasticsearch()

# NerDocument implements for_json function for easy serialization with simplejson.
doc: NerDocument
for doc in processed_pdf:
    res = es.index(index='pdfner', id=doc.id, body=json.dumps(doc, for_json=True))
    print(res['result'])
```

### Indexing with Solr
```python
import pysolr
# Collection "gettingstarted" auto created by: solr -c -e schemaless
solr = pysolr.Solr('http://localhost:8983/solr/gettingstarted', always_commit=True)

# encode returns NerDocument object as dict which is required by pysolr 
solr.add([doc.encode() for doc in processed_pdf])
```

### API

#### process_pdf
A function that converts a scanned PDF to a text-based PDF and applies the NER detector object to the text to extract entities. Returns a list of NerDocument objects.

- **filepath: str** - path to PDF file
- **make_thumbnail: Optional[bool]=False** - whether to create a thumbnail PNG for the first page
- **cache_entities: Optional[bool]=False** - whether to cache entities to the local filesystem
- **parallelize_pages: Optional[bool]=True** - whether to process multiple pages in parallel
- **out_filepath: Optional[str]=None** - optional location of resulting processed PDF
- **entities_detector: AbstractDetectEntities** - named argument for NER detector object (SpacyDetectEntities, CoreNlpDetectEntities)
- **\*\*kwargs** - additional named arguments to attach to the returned NerDocument objects

#### AbstractDetectEntities
Roll your own NER detector by subclassing AbstractDetectEntities and overriding detect_entities.

- **detect_entities(text: str, \*\*kwargs)** - extract entities from input text and returns a list of NamedEntity objects

#### NerDocument
A class representing a single page of a processed PDF.

##### Attributes
- **id: str** - auto-generated random UUID 
- **text: str** - text extracted from PDF page
- **page_number: int** - PDF page number
- **entities: List[str]** - entities extracted from PDF text
- **processed_location: str** - location of processed PDF
- **original_location: str** - location of original PDF
- **redacted_location: str** - location of redacted PDF
- **thumbnail_location: str** - location of thumbnail PNG for first page of processed PDF
- **\*\*kwargs** - additional named arguments to store with object

##### Instance methods
- **encode()** - returns dict representation of object
- **for_json()** - for simplejson to serialize object to JSON

##### Class methods
- **decode(d: Dict)** - object_hook function for simplejson's loads function to deserialize JSON to a proper NerDocument object


