Metadata-Version: 2.1
Name: leaf-focus
Version: 0.4.1
Summary: Extract structured text from pdf files.
Project-URL: Homepage, https://github.com/anotherbyte-net/leaf-focus
Project-URL: Changelog, https://github.com/anotherbyte-net/leaf-focus/blob/main/CHANGELOG.md
Project-URL: Source, https://github.com/anotherbyte-net/leaf-focus
Project-URL: Tracker, https://github.com/anotherbyte-net/leaf-focus/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: matplotlib (==3.5.3)
Requires-Dist: defusedxml (==0.7.1)
Requires-Dist: pyoxidizer (==0.22)
Requires-Dist: importlib-resources (==5.9.0)
Requires-Dist: keras-ocr (==0.9.1) ; python_version < "3.10"
Requires-Dist: tensorflow (==2.10.0) ; python_version < "3.10"
Requires-Dist: numpy (==1.21.6) ; python_version < "3.8"
Requires-Dist: importlib-metadata (==4.2.0) ; python_version < "3.8"
Requires-Dist: typing-inspect (==0.8.0) ; python_version < "3.8"
Requires-Dist: numpy (==1.23.3) ; python_version >= "3.8"
Requires-Dist: importlib-metadata (==4.12.0) ; python_version >= "3.8"
Provides-Extra: dev
Requires-Dist: pip (==22.2.2) ; extra == 'dev'
Requires-Dist: setuptools (==65.3.0) ; extra == 'dev'
Requires-Dist: wheel (==0.37.1) ; extra == 'dev'
Requires-Dist: build (==0.8.0) ; extra == 'dev'
Requires-Dist: twine (==4.0.1) ; extra == 'dev'
Requires-Dist: pytest (==7.1.3) ; extra == 'dev'
Requires-Dist: pytest-mock (==3.8.2) ; extra == 'dev'
Requires-Dist: pytest-cov (==3.0.0) ; extra == 'dev'
Requires-Dist: tblib (==1.7.0) ; extra == 'dev'
Requires-Dist: tox (==3.26.0) ; extra == 'dev'
Requires-Dist: coverage (==6.4.4) ; extra == 'dev'
Requires-Dist: black (==22.8.0) ; extra == 'dev'
Requires-Dist: flake8 (==5.0.4) ; extra == 'dev'
Requires-Dist: flake8-annotations-coverage (==0.0.6) ; extra == 'dev'
Requires-Dist: flake8-black (==0.3.3) ; extra == 'dev'
Requires-Dist: flake8-bugbear (==22.8.23) ; extra == 'dev'
Requires-Dist: flake8-comprehensions (==3.10.0) ; extra == 'dev'
Requires-Dist: flake8-unused-arguments (==0.0.11) ; extra == 'dev'
Requires-Dist: mypy (==0.971) ; extra == 'dev'
Requires-Dist: pylint (==2.15.2) ; extra == 'dev'
Requires-Dist: pydocstyle (==6.1.1) ; extra == 'dev'
Requires-Dist: pyright (==1.1.270) ; extra == 'dev'
Requires-Dist: types-dateparser (==1.1.4) ; extra == 'dev'
Requires-Dist: types-PyYAML (==6.0.11) ; extra == 'dev'
Requires-Dist: types-requests (==2.28.10) ; extra == 'dev'
Requires-Dist: types-backports (==0.1.3) ; extra == 'dev'
Requires-Dist: types-urllib3 (==1.26.24) ; extra == 'dev'
Requires-Dist: pdoc3 (==0.10.0) ; extra == 'dev'
Requires-Dist: pyre-check (==0.9.16) ; (platform_system != "Windows") and extra == 'dev'
Requires-Dist: pytype (==2022.8.3) ; (python_version <= "3.10" and platform_system != "Windows") and extra == 'dev'

# leaf-focus

Extract structured text from pdf files.

## Install

Install from PyPI using pip:

```bash
pip install leaf-focus
```

[![PyPI](https://img.shields.io/pypi/v/leaf-focus)](https://pypi.org/project/leaf-focus/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/leaf-focus)
![GitHub Workflow Status (branch)](https://img.shields.io/github/workflow/status/anotherbyte-net/leaf-focus/Create%20Package/main)

Download the [Xpdf command line tools](https://www.xpdfreader.com/download.html) and extract the executable files.

Provide the directory containing the executable files as `--exe-dir`.


## Usage

```text
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
                  [--first FIRST] [--last LAST]
                  [--log-level {debug,info,warning,error,critical}]
                  input_pdf output_dir

Extract structured text from a pdf file.

positional arguments:
  input_pdf             path to the pdf file to read
  output_dir            path to the directory to save the extracted text files

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --exe-dir EXE_DIR     path to the directory containing xpdf executable files
  --page-images         save each page of the pdf as a separate image
  --ocr                 run optical character recognition on each page of the
                        pdf
  --first FIRST         the first pdf page to process
  --last LAST           the last pdf page to process
  --log-level {debug,info,warning,error,critical}
                        the log level: debug, info, warning, error, critical
```

### Examples

```bash
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages

# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
```
