Metadata-Version: 2.4
Name: docat
Version: 1.0.0
Summary: Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!
Project-URL: Homepage, https://github.com/lluises/docat
Project-URL: Issues, https://github.com/lluises/docat/issues
Author-email: LluísE <lluise@github.com>
License-File: LICENCE
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.6
Requires-Dist: docx2txt>=0.8
Requires-Dist: ppt2txt>=0.1
Requires-Dist: pypdf==4
Requires-Dist: python-pptx>=0.6
Requires-Dist: xlrd>=2.0
Description-Content-Type: text/markdown

# Docat

Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!

**Github**: [https://github.com/lluises/docat](https://github.com/lluises/docat)


## How it works

Docat works by identifying the document through the MIME type, and then selects a parser to extract all the text from the document.

Currently no OCR is implemented, therefore no text is extracted from images.


# Use from CLI

```
usage: docat [-h] [-o OUTPUT] [-l] [-nl] [documents ...]

Document to plain text transformation tool

positional arguments:
  documents            Documents to process

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output file. By default outputs to stdout
  -l, --list           List all supported mime types and exit
  -nl, --newline       Ensure that the output ends with a newline (\n)
```

## Example

```bash
docat myfile.pdf
```

Will output all the text from myfile.pdf to the console (stdout).



# Use as a python library

```python
import docat

text = docat.process("path/to/myfile.pdf")

print(text)
```

Using _Path_ from _pathlib_ is also supported:

```python
from pathlib import Path
import docat

file_path = Path(".") / "myfile.pdf"
text = docat.process(file_path)

print(text)
```



# Supported files

Currently, docat supports:

- Microsoft Docs
- Microsoft Excel
- Microsoft PowerPoint
- Open document (LibreOffice, OpenOffice...)
- PDF
- Plain text files
- SVG with plain text embedded

Suggestions for more documents are welcome.


## MIME types

The following MIME types are currently supported by **docat**:

- application/javascript
- application/json
- application/msword
- application/pdf
- application/vnd.ms-excel
- application/vnd.ms-excel.sheet.macroEnabled.12
- application/vnd.ms-powerpoint
- application/vnd.ms-word.document.macroEnabled.12
- application/vnd.oasis.opendocument.presentation
- application/vnd.oasis.opendocument.spreadsheet
- application/vnd.oasis.opendocument.text
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.presentationml.slideshow
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- image/svg+xml
- text


# License

All the code in this repository is licensed under the **Apache License Version 2.0**. You may get a copy of the license in the [LICENCE](./LICENCE) file, or online at [https://www.apache.org/licenses/LICENSE-2.0.txt](https://www.apache.org/licenses/LICENSE-2.0.txt).

This program depends on other software packages, which have their own license. Check them to ensure compatibility with your project.


