Metadata-Version: 2.0
Name: minecart
Version: 0.2
Summary: Simple, Pythonic extraction of images, text, and shapes from PDFs
Home-page: https://github.com/felipeochoa/minecart
Author: Felipe Ochoa
Author-email: find me through Github
License: MIT
Keywords: pdf pdfminer extract mining images
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 2 :: Only
Classifier: License :: OSI Approved :: MIT License
Requires-Dist: pdfminer
Provides-Extra: PIL
Requires-Dist: Pillow; extra == 'PIL'

minecart: A Pythonic interface to PDF documents
===============================================

|Travis CI build status (Linux)| |Coverage Status|

``minecart`` is a Python package that simplifies the extraction of text,
images, and shapes from a PDF document. It provides a very Pythonic
interface to extract positioning, color, and font metadata for all of
the objects in the PDF. It is a pure-Python package (it depends on
`pdfminer`_ for the low-level parsing). ``minecart`` takes
inspiration from Tim McNamara’s `slate`_ , but aims to provide more
detailed information:

.. code:: python

    >>> pdffile = open('example.pdf', 'rb')
    >>> doc = minecart.Document(pdffile)
    >>> page = doc.get_page(3)
    >>> for shape in page.shapes.iter_in_bbox((0, 0, 100, 200)):
    ...     print shape.path, shape.fill.color.as_rgb()
    >>> im = page.images[0].as_pil()  # requires pillow
    >>> im.show()

Installation
------------

Currently only Python 2.7 is supported. Plans for supporting 3.4+ (using
`pdfminer.six`_) is planned.

1. The easy way: ``pip instal minecart``
2. The hard way: download the source code, change into the working
   directory, and run ``python setup.py install``

**For CJK languages**: Supporting the CJK languages requires an
addtional step, as detailed `in pdfminer`_.

Currently supported features
----------------------------

-  *Shapes*: You can extract path information, bounding box, stroke
   parameters, and stroke/fill colors. Color support is fairly robust,
   allowing the simple ``.as_rgb()`` in most cases. (To be concrete,
   ``minecart`` supports the ``DeviceRGB``, ``DeviceCMYK``,
   ``DeviceGray``, and ``CIE-based`` color spaces. ``Indexed`` colors
   are supported if they index into one of the above.)
-  *Images*: ``minecart`` can easily extract images to ``PIL.Image``
   objects.
-  *Text*: (Called ``Lettering`` in the source) In addition to
   extracting plain text from the PDF, you have access to
   position/bounding box information and the font used.

If there’s a feature you’d like to extract from a PDF that’s not
currently supported, open up an issue or submit a pull request! I’m
especially interested in hearing whether there are many PDFs using color
spaces outside of the ones currently supported.

Documentation
-------------

The main entry point will always be ``minecart.Document``, which accepts
a single parameter, an open file-like object which will be read to
create the document. The ``Document`` has two primary methods for
accessing its contents: ``.get_page(num)`` and ``iter_pages()``. Both
methods return ``minecart.Page`` objects, which provide access to the
graphical elements found on the page. ``Page`` objects have three main
attributes:

-  ``.images``: A list of all the ``minecart.Image`` objects found on
   the page.

-  ``.letterings``: A list of all the text objects found on the page, as
   ``Lettering`` objects. ``Lettering`` is a ``unicode`` subclass which
   adds bounding box and font information (using ``.get_bbox()`` or
   ``.font``).

-  ``.shapes``: A list of all the squares, circles, lines, etc. found on
   the page as ``Shape`` objects. ``Shape`` objects have three main
   attributes of interest:

-  ``stroke``: An object containing the stroke parameters used to draw
   the shape. ``.stroke`` has ``.color``, ``.linewidth``, ``.linecap``,
   ``.linejoin``, ``.miterlimit``, and ``.dash`` attributes. If the
   shape was not stroked, ``.stroke`` will be ``None``.

-  ``.fill``: An object containing the fill parameters used to draw the
   shape. Right now, ``.fill`` only has a ``.color``\ parameter.

-  ``.path``: A list with the coordinates used to defined the shape, as
   well as the type of line segment each set of coordinates defines.
   Refer to the ``minecart.Shape`` documentation for more details

I try to keep docstrings complete and up to date, so you can read
through the source or use ``dir`` and ``help`` to see what methods are
available. Most of the public interface is implemented in the
``content`` class, and ``miner`` has more of the PDF nitty-gritty stuff.

Contributing
------------

Bug reports are always welcome (using the GitHub tracker) as are feature
requests. The PDF spec has so many corners, it is hard for me to
prioritize implementing access to its various features. If there’s
something you’d like to extract from a document but isn’t currently
supported, please `create a new issue`_.

If you’d like to contribute code, you can either create an issue and
include a patch (if the changes are small) or fork the project and
create a pull request.

.. _`create a new issue`: https://github.com/felipeochoa/minecart/issues/new
.. _pdfminer: https://github.com/euske/pdfminer
.. _`in pdfminer`: https://github.com/euske/pdfminer#for-cjk-languages
.. _slate: https://github.com/timClicks/slate
.. _pdfminer.six: https://github.com/goulu/pdfminer
.. |Travis CI build status (Linux)| image:: https://travis-ci.org/felipeochoa/minecart.svg?branch=master
   :target: https://travis-ci.org/felipeochoa/minecart
.. |Coverage Status| image:: https://coveralls.io/repos/felipeochoa/minecart/badge.svg
   :target: https://coveralls.io/r/felipeochoa/minecart


