Metadata-Version: 1.2
Name: pdftotree
Version: 0.2.4
Summary: Parse PDFs into HTML-like trees.
Home-page: https://github.com/HazyResearch/pdftotree
Author: Hazy Research
Author-email: senwu@cs.stanford.edu
License: MIT
Project-URL: Source, https://github.com/HazyResearch/pdftotree
Project-URL: Tracker, https://github.com/HazyResearch/pdftotree/issues
Description-Content-Type: UNKNOWN
Description: pdftotree
        =========
        
        |GitHub issues| |GitHub license| |GitHub stars| |Build Status| |PyPI|
        |PyPI - Python Version|
        
        `Fonduer <https://hazyresearch.github.io/snorkel/blog/fonduer.html>`__
        has been successfully extended to perform information extraction from
        richly formatted data such as tables. A crucial step in this process is
        the construction of the hierarchical tree of context objects such as
        text blocks, figures, tables, etc. The system currently uses PDF to HTML
        conversion provided by Adobe Acrobat. However, Adobe Acrobat is not an
        open source tool, which may be inconvenient for Fonduer users.
        
        This package is the result of building our own module as replacement to
        Adobe Acrobat. Several open source tools are available for pdf to html
        conversion but these tools do not preserve the cell structure in a
        table. Our goal in this project is to develop a tool that extracts text,
        figures and tables in a pdf document and maintains the structure of the
        document using a tree data structure.
        
        Dependencies
        ------------
        
        ::
        
            sudo apt-get install python3-tk
        
        Installation
        ------------
        
        ``pip install pdftotree``
        
        Usage
        -----
        
        pdftotree as a Python package
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        .. code:: py
        
            import pdftotree
        
            pdftotree.parse(pdf_file, html_path=None, model_path=None, favor_figures=True, visualize=False):
        
        extract\_tree
        ~~~~~~~~~~~~~
        
        This is the primary command-line utility provided with this Python
        package. This takes a PDF file as input, and produces an HTML-like
        representation of the data.
        
        ::
        
            usage: extract_tree [-h] [--model_path MODEL_PATH] --pdf_file PDF_FILE
                                [--html_path HTML_PATH] [--favor_figures FAVOR_FIGURES]
                                [--visualize] [-v] [-vv]
        
            Script to extract tree structure from PDF files. Takes a PDF as input and
            outputs an HTML-like representation of the document's structure. By default,
            this conversion is done using heuristics. However, a model can be provided as
            a parameter to use a machine-learning-based approach.
        
            optional arguments:
              -h, --help            show this help message and exit
              --model_path MODEL_PATH
                                    Pretrained model, generated by extract_tables tool
              --pdf_file PDF_FILE   PDF file name for which tree structure needs to be
                                    extracted
              --html_path HTML_PATH
                                    Path where tree structure should be saved. If none,
                                    HTML is printed to stdout.
              --favor_figures FAVOR_FIGURES
                                    Whether figures must be favored over other parts such
                                    as tables and section headers
              --visualize           Whether to output visualization images for the tree
              -v                    Output INFO level logging.
              -vv                   Output DEBUG level logging.
        
        extract\_tables
        ~~~~~~~~~~~~~~~
        
        ::
        
            usage: extract_tables [-h] [--mode MODE] --model-path MODEL_PATH
                                  [--train-pdf TRAIN_PDF] --test-pdf TEST_PDF
                                  [--gt-train GT_TRAIN] --gt-test GT_TEST --datapath
                                  DATAPATH [--iou-thresh IOU_THRESH] [-v] [-vv]
        
            Script to extract tables bounding boxes from PDF files using machine learning.
            If `model.pkl` is saved in the model-path, the pickled model will be used for
            prediction. Otherwise the model will be retrained. If --mode is test (by
            default), the script will create a .bbox file containing the tables for the
            pdf documents listed in the file --test-pdf. If --mode is dev, the script will
            also extract ground truth labels for the test data and compute statistics.
        
            optional arguments:
              -h, --help            show this help message and exit
              --mode MODE           Usage mode dev or test, default is test
              --model-path MODEL_PATH
                                    Path to the model. If the file exists, it will be
                                    used. Otherwise, a new model will be trained.
              --train-pdf TRAIN_PDF
                                    List of pdf file names used for training. These files
                                    must be saved in the --datapath directory. Required if
                                    no pretrained model is provided.
              --test-pdf TEST_PDF   List of pdf file names used for testing. These files
                                    must be saved in the --datapath directory.
              --gt-train GT_TRAIN   Ground truth train tables. Required if no pretrained
                                    model is provided.
              --gt-test GT_TEST     Ground truth test tables.
              --datapath DATAPATH   Path to directory containing the input documents.
              --iou-thresh IOU_THRESH
                                    Intersection over union threshold to remove duplicate
                                    tables
              -v                    Output INFO level logging
              -vv                   Output DEBUG level logging
        
        PDF List Format
        
        The list of PDFs are simply a single filename on each line. For example:
        
        ::
        
            1-s2.0-S000925411100369X-main.pdf
            1-s2.0-S0009254115301030-main.pdf
            1-s2.0-S0012821X12005717-main.pdf
            1-s2.0-S0012821X15007487-main.pdf
            1-s2.0-S0016699515000601-main.pdf
        
        Ground Truth File Format
        
        The ground truth is formatted to mirror the PDF List. That is, the first
        line of the ground truth file provides the labels for the first document
        in corresponding PDF list. Labels take the form of semicolon-separated
        tuples containing the values
        ``(page_num, page_width, page_height, top, left, bottom, right)``. For
        example:
        
        ::
        
            (10, 696, 951, 634, 366, 832, 653);(14, 696, 951, 720, 62, 819, 654);(4, 696, 951, 152, 66, 813, 654);(7, 696, 951, 415, 57, 833, 647);(8, 696, 951, 163, 370, 563, 652)
            (11, 713, 951, 97, 47, 204, 676);(11, 713, 951, 261, 45, 357, 673);(3, 713, 951, 110, 44, 355, 676);(8, 713, 951, 763, 55, 903, 687)
            (5, 672, 951, 88, 57, 203, 578);(5, 672, 951, 593, 60, 696, 579)
            (5, 718, 951, 131, 382, 403, 677)
            (13, 713, 951, 119, 56, 175, 364);(13, 713, 951, 844, 57, 902, 363);(14, 713, 951, 109, 365, 164, 671);(8, 713, 951, 663, 46, 890, 672)
        
        One method to label these tables is to use
        `DocumentAnnotation <https://github.com/payalbajaj/DocumentAnnotation>`__,
        which allows you to select table regions in your web browser and
        produces the bounding box file.
        
        Example Dataset: Paleontological Papers
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        A full set of documents and ground truth labels can be `downloaded
        here <http://i.stanford.edu/hazy/share/fonduer/pdftotree_paleo.tar.gz>`__.
        You can train a machine-learning model to extract table regions by
        downloading this dataset and extracting it into a directory named
        ``data`` and then running the command below. Double check that the paths
        in the command match wherever you have downloaded the data.
        
        ::
        
            extract_tables --train-pdf data/paleo/ml/train.pdf.list.paleo.not.scanned --gt-train data/paleo/ml/gt.train --test-pdf data/paleo/ml/test.pdf.list.paleo.not.scanned --gt-test data/paleo/ml/gt.test --datapath data/paleo/documents/ --model-path data/model.pkl
        
        The resulting model of this example command would be saved as
        ``data/model.pkl``.
        
        For Developers
        --------------
        
        Tests
        ~~~~~
        
        Once you've cloned this repository, first make sure you ahve the
        dependencies installed
        
        ::
        
            pip install -r requirements.txt
        
        Then you can run our tests
        
        ::
        
            pytest tests -rs
        
        To test changes in the package, you can also install it locally in your
        virtualenv by running
        
        ::
        
            python setup.py develop
        
        .. |GitHub issues| image:: https://img.shields.io/github/issues/HazyResearch/pdftotree.svg
           :target: https://github.com/HazyResearch/pdftotree/projects/2
        .. |GitHub license| image:: https://img.shields.io/github/license/HazyResearch/pdftotree.svg
           :target: https://github.com/HazyResearch/pdftotree/blob/master/LICENSE
        .. |GitHub stars| image:: https://img.shields.io/github/stars/HazyResearch/pdftotree.svg
           :target: https://github.com/HazyResearch/pdftotree/stargazers
        .. |Build Status| image:: https://travis-ci.org/HazyResearch/pdftotree.svg?branch=master
           :target: https://travis-ci.org/HazyResearch/pdftotree
        .. |PyPI| image:: https://img.shields.io/pypi/v/pdftotree.svg
           :target: https://pypi.python.org/pypi/pdftotree
        .. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/pdftotree.svg
           :target: https://pypi.python.org/pypi/pdftotree
        
Keywords: pdf,parsing,html
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >3
