Metadata-Version: 2.0
Name: pdfx
Version: 1.2.1
Summary: Extract metadata and URLs from PDF files, and download all referenced PDFs
Home-page: http://www.metachris.com/pdfx
Author: Chris Hager
Author-email: chris@linuxuser.at
License: Apache
Keywords: pdf extract download urls
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Topic :: Software Development :: Build Tools
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Dist: chardet
Requires-Dist: pdfminer2
Provides-Extra: dev
Requires-Dist: check-manifest; extra == 'dev'
Provides-Extra: test
Requires-Dist: tox; extra == 'test'

====
PDFx
====

.. image:: https://badge.fury.io/py/pdfx.svg
   :target: https://pypi.python.org/pypi/pdfx

.. image:: https://travis-ci.org/metachris/pdfx.svg?branch=master
   :target: https://travis-ci.org/metachris/pdfx

.. image:: https://img.shields.io/badge/license-Apache-blue.svg
   :target: https://github.com/metachris/pdfx/blob/master/LICENSE

Introduction
============

Extract metadata and references from a local or remote PDF, and optionally download all referenced PDFs.

**Features**

* Extract metadata and references from a given PDF
* Detects pdf, url, arxiv and doi references
* **Fast, parallel download of all referenced PDFs**
* Output as text or JSON (using the ``-j`` flag)
* Extract the PDF text (using the ``--text`` flag)
* Use as command-line tool or Python package
* Compatible with Python 2 and 3
* Works with local and online pdfs


Getting Started
===============

Grab a copy of the code with ``easy_install`` or ``pip``, and run it::

    $ easy_install -U pdfx
    ...
    $ pdfx <pdf-file-or-url>

Run ``pdfx -h`` to see the help output::

    $ pdfx -h
    usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-j] [-v] [-t] [-o OUTPUT_FILE]
                [--version]
                pdf

    Extract metadata and references from a PDF, and optionally download all
    referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

    positional arguments:
      pdf                   Filename or URL of a PDF file

    optional arguments:
      -h, --help            show this help message and exit
      -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                            Download all referenced PDFs into specified directory
      -j, --json            Output infos as JSON (instead of plain text)
      -v, --verbose         Print all references (instead of only PDF)
      -t, --text            Only extract text (no metadata or references)
      -o OUTPUT_FILE, --output-file OUTPUT_FILE
                            Output to specified file instead of console
      --version             show program's version number and exit


Examples
========

Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf::

    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
    Document infos:
    - CreationDate = D:20150821110623-04'00'
    - Creator = LaTeX with hyperref package
    - ModDate = D:20150821110805-04'00'
    - PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
    - Pages = 13
    - Producer = pdfTeX-1.40.14
    - Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
    - Trapped = False
    - dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
    - pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
    - pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
    - xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
    - xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

    References: 36
    - URL: 18
    - PDF: 18

    PDF References:
    - http://www.spiegel.de/media/media-35533.pdf
    - http://www.spiegel.de/media/media-35513.pdf
    - http://www.spiegel.de/media/media-35509.pdf
    - http://www.spiegel.de/media/media-35529.pdf
    - http://www.spiegel.de/media/media-35527.pdf
    - http://cr.yp.to/factorization/smoothparts-20040510.pdf
    - http://www.spiegel.de/media/media-35517.pdf
    - http://www.spiegel.de/media/media-35526.pdf
    - http://www.spiegel.de/media/media-35519.pdf
    - http://www.spiegel.de/media/media-35522.pdf
    - http://cryptome.org/2013/08/spy-budget-fy13.pdf
    - http://www.spiegel.de/media/media-35515.pdf
    - http://www.spiegel.de/media/media-35514.pdf
    - http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
    - http://www.spiegel.de/media/media-35528.pdf
    - http://www.spiegel.de/media/media-35671.pdf
    - http://www.spiegel.de/media/media-35520.pdf
    - http://www.spiegel.de/media/media-35551.pdf

You can use the ``-v`` flag to output all references instead of just the PDFs.

Download all referenced pdfs with **``-d``** (for ``download-pdfs``) to the specified directory (eg. ``./``)::

    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
    ...


Usage as Python library
=======================

::

    >>> import pdfx
    >>> pdf = pdfx.PDFx("filename-or-url.pdf")
    >>> metadata = pdf.get_metadata()
    >>> references_list = pdf.get_references()
    >>> references_dict = pdf.get_references_as_dict()
    >>> pdf.download_pdfs("target-directory")


TODO
====

* https://github.com/metachris/pdfx/issues


Various
=======

* Author: Chris Hager <chris@linuxuser.at>
* Homepage: https://www.metachris.com/pdfx
* License: Apache

Feedback, ideas and pull requests are welcome!


