Metadata-Version: 2.1
Name: pypdf-lib
Version: 0.0.2
Summary: pypdf
Home-page: http://github.com/trisongz/pypdf-lib
Author: Tri Songz
Author-email: ts@growthengineai.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: file-io

# pypdf-lib

A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP. 

This library is a Python Wrapper built around [PdfAct](https://github.com/ad-freiburg/pdfact), which is built using Java.

## Pre-requisites

- `Java`

```bash
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless

# Mac
brew install java

# Windows
# idk
```

## Installation

```python
!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
!pip install --upgrade pypdf-lib

```

## Usage

```python
from pypdf import PyPDF
from fileio import File

base_dir = '/content/output'
File.mkdirs(base_dir)

# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.

def remap_fnames(fname):
    fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
    return File.join(base_dir, fname)

converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)

# remap_funct is optional. 
for res in converter.extract(remap_funct=remap_fnames):
    print(res)
    # > /content/output/your_json_file_1.json

converter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
 '/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
 '/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
 'params': {'exclude_roles': None,
  'format': 'json',
  'include_roles': ['title',
   'body',
   'appendix',
   'keywords',
   'heading',
   'general_terms',
   'toc',
   'caption',
   'table',
   'other',
   'categories',
   'keywords',
   'page_header'],
  'units': ['paragraphs', 'blocks'],
  'visualize': True,
  'with_control_characters': False}}
'''
```

