Metadata-Version: 2.1
Name: texta-parsers
Version: 2.8.2
Summary: texta-parsers
Author: TEXTA
Author-email: info@texta.ee
License: GPLv3
Description-Content-Type: text/markdown
Provides-Extra: mlp
License-File: LICENSE

# texta-parsers

A Python package for file parsing.

The main class in the package is **DocParser**. The package also supports sophisticated parsing of emails which is implemented in class **EmailParser**. If you only need to parse emails then you can specify it with parameter `parse_only_extensions`. It is possible to use **EmailParser** independently as well but then attachments will not be parsed. 


## Requirements

Most of the file types are parsed with **[tika](http://tika.apache.org/)**. Other tools that are required:

| Tool | File Type |
|---|---|
| pst-utils | .pst  |
| digidoc-tool | .ddoc .bdoc .asics .asice |
| rar-nonfree  | .rar |
| lxml | XML HTML |

Installation of required packages on Ubuntu/Debian:

`sudo apt-get install pst-utils rar python3-lxml cmake build-essential -y`

`sudo sh install-digidoc.sh`

Requires our custom version of Apache TIKA with relevant Tesseract language packs installed:

`sudo docker run -p 9998:9998 docker.texta.ee/texta/texta-parsers-python/tikaserver:latest`

## Installation

Base install (without MLP & Face Analyzer):

`pip install texta-parsers`

Install with MLP:

`pip install texta-parsers[mlp]`


Install with whole bundle:

`pip install texta-parsers[mlp]`


## Testing

`python -m  pytest -rx -v tests`


## Description

#### DocParser

A file parser. Input can either be in bytes or a path to the file as a string. See [user guide](https://git.texta.ee/texta/email-parser/-/wikis/DocParser/User-Guide/Getting-started) more information. DocParser also includes EmailParser.

#### EmailParser

For parsing email messages and mailboxes. Supported file formats are Outlook Data File (**.pst**), mbox (**.mbox**) and EML (**.eml**). Can be used separately from DocParser but then attachments are not parsed.
User guide can be found [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/User-Guide/Getting-started) and documentation [here](https://git.texta.ee/texta/email-parser/-/wikis/EmailParser/Documentation/1.2.1).
