Metadata-Version: 2.1
Name: pdftxt
Version: 0.3.2
Summary: PDF text extractor.
Home-page: https://bitbucket.org/mgemmill/pdftxt
License: BSD-3-Clause
Author: Mark Gemmill
Author-email: bitbucket@markgemmill.com
Requires-Python: >=3.6,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Dist: docopt (>=0.6.2,<0.7.0)
Requires-Dist: pdfminer.six (>=20181108.0,<20181109.0)
Description-Content-Type: text/markdown

### pdftxt

The goal of this project is to provide an api to extract text
from specific regions of a pdf document/page and a cli to assist
identifying the location of text within a document.

### Installation

    ... pip install pdftxt


### Basic Command Line Usage

Let's say we have a PDF file (PDF-DOC.pdf) that looks like this:

![Source File Image](https://bytebucket.org/mgemmill/pdftxt/raw/36ef6c80f953ac5d4eae712d5c7943c23e8914bc/assets/readme_src_doc_.jpg)

The `pdftxt` command:

    ... pdftxt PDF-DOC.pdf

Will output a visual layout of the pdf document's pages and text elements to an html page:

![Output File Image](https://bytebucket.org/mgemmill/pdftxt/raw/36ef6c80f953ac5d4eae712d5c7943c23e8914bc/assets/readme_output_doc_.jpg)


### API Usage


    from pathlib import Path
    from pdftxt import api

    filepath = 'tests/Word_PDF.pdf'

    with api.PdfTxtContext(filepath) as pdf:

        for page in pdf:

            # To fetch text objects from specific region
            # of the page, first define the region:
            region = api.Region(400, 300, 512, 317)

            # Initialize layout parameters:
            params = api.PdfTxtParams()

            # Then analyze that area of the page for text objects:
            text = page.analyze(region, params)

            # Do whatever it is we need to do with the results:
            for txt in text:
                print(txt.text)

