Metadata-Version: 2.1
Name: pdf4py
Version: 0.0.1
Summary: A PDF parser written in Python3 with no external dependencies.
Home-page: https://github.com/Halolegend94/pdf4py
Author: Cristian Di Pietrantonio
Author-email: cristiandipietrantonio@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# pdf4py

A PDF parser written in Python 3 with no external dependencies.

The package pdf4py allows the user to interact with a PDF file at a low level and to build higher level functionalities (e.g. text and/or image extraction). In particular, it defines the class Parser that reads the Cross Reference Table of a PDF document and uses its entries to give the user the ability to locate PDF objects within the file and parse them into suitable Python objects.

## Quick example

Here is a quick demostration on how to use pdf4py.

```python
>>> from pdf4py.parser import Parser
>>> fp = open('tests/pdfs/0000.pdf', 'rb')
>>> parser = Parser(fp)
>>> info_ref = parser.trailer['Info']
>>> print(info_ref)
PDFReference(object_number=114, generation_number=0)
>>> info = parser.parse_reference(info_ref).value
>>> print(info)
{'Creator': PDFLiteralString(value=b'PaperCept Conference Management System'),
    ... , 'Producer': PDFLiteralString(value=b'PDFlib+PDI 7.0.3 (Perl 5.8.0/Linux)')}
>>> creator = info['Creator'].value.decode('utf8')
>>> print(creator)
PaperCept Conference Management System
```

## Extracting text or images

Extracting text from a PDF and other higher level analysis tasks are not natively supported because
of two reasons:

- their complexity is not trivial and would require a not indifferent amount of work which now I prefer
investing into developing a complete and reliable parser;
- they are conceptually different tasks from PDF parsing, since the PDF does not define the concept of
structured document from the content point of view.

Therefore, they require a separate implementation built on top of `pdf4py`.

## Why this package

One day at work I was asked to analyze some PDF files; to my surprise I have discovered that
there is not an established Python module to easily parse a PDF document. In order to understand
why I delved into the PDF 1.7 specification: since from moment on, I got more and more interested
in the inner workings of one of the most important and ubiquitous file format. And what's
a better way to understand the PDF than writing a parser for it?


## PDF standard coverage

You can check how many features of the standard are implemented and what is the progress on
supporting the missing ones by checking the standard coverage [page](docs/source/standard_coverage.rst).


## Contributing

Contributions are more than welcome! You can

- filing a new issue;
- proposing changes and additions through a pull request.


