Metadata-Version: 2.0
Name: pdftabextract
Version: 0.2.0
Summary: A set of tools for data mining (OCR-processed) PDFs
Home-page: https://github.com/WZBSocialScienceCenter/pdftabextract
Author: Markus Konrad
Author-email: markus.konrad@wzb.eu
License: Apache 2.0
Description-Content-Type: UNKNOWN
Keywords: datamining ocr pdf tabular data mining extract extraction
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Dist: numpy
Requires-Dist: opencv-python
Requires-Dist: scipy
Provides-Extra: pandas_dataframes
Requires-Dist: pandas; extra == 'pandas_dataframes'

This repository contains a set of tools written in Python 3 with the aim to extract tabular
data from scanned and OCR-processed documents available as PDF files. Before these files can be processed they need
to be converted to XML files in pdf2xml format using poppler utils. Further information and examples can be found
in the github repository.

