Metadata-Version: 2.1
Name: qctc_doc
Version: 0.1.2
Summary: A lightweight toolbox to manipulate documents
Home-page: https://github.com/QCTC-chain/magic-doc
License: Apache 2.0
Requires-Python: >=3.10, <3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: alembic==1.13.1
Requires-Dist: aniso8601==9.0.1
Requires-Dist: blinker==1.8.2
Requires-Dist: cchardet==2.1.7
Requires-Dist: certifi==2024.2.2
Requires-Dist: charset-normalizer==3.3.2
Requires-Dist: docopt==0.6.2
Requires-Dist: Flask==3.0.3
Requires-Dist: Flask-Cors==4.0.1
Requires-Dist: Flask-JWT-Extended==4.6.0
Requires-Dist: flask-marshmallow==1.2.1
Requires-Dist: Flask-Migrate==4.0.7
Requires-Dist: Flask-RESTful==0.3.10
Requires-Dist: Flask-SQLAlchemy==3.1.1
Requires-Dist: func-timeout==4.3.5
Requires-Dist: greenlet==3.0.3
Requires-Dist: idna==3.7
Requires-Dist: itsdangerous==2.2.0
Requires-Dist: Jinja2==3.1.4
Requires-Dist: lark-parser==0.12.0
Requires-Dist: lxml==5.1.1
Requires-Dist: Mako==1.3.5
Requires-Dist: MarkupSafe==2.1.5
Requires-Dist: marshmallow==3.21.2
Requires-Dist: marshmallow-sqlalchemy==1.0.0
Requires-Dist: packaging==24.0
Requires-Dist: py-asciimath==0.3.0
Requires-Dist: PyJWT==2.8.0
Requires-Dist: pytz==2024.1
Requires-Dist: PyYAML==6.0.1
Requires-Dist: requests==2.32.2
Requires-Dist: six==1.16.0
Requires-Dist: SQLAlchemy==2.0.30
Requires-Dist: typing_extensions==4.11.0
Requires-Dist: urllib3
Requires-Dist: Werkzeug==3.0.3
Requires-Dist: python-pptx
Requires-Dist: s3pathlib
Requires-Dist: PyMuPDF>=1.24.9
Requires-Dist: smart-open[s3]
Provides-Extra: gpu
Requires-Dist: paddlepaddle==3.0.0b1; platform_system == "Linux" and extra == "gpu"
Requires-Dist: paddlepaddle==2.6.1; (platform_system == "Windows" or platform_system == "Darwin") and extra == "gpu"
Requires-Dist: paddleocr==2.7.3; extra == "gpu"
Requires-Dist: magic-pdf[full]==0.7.1; extra == "gpu"
Provides-Extra: cpu
Requires-Dist: paddlepaddle==3.0.0b1; platform_system == "Linux" and extra == "cpu"
Requires-Dist: paddlepaddle==2.6.1; (platform_system == "Windows" or platform_system == "Darwin") and extra == "cpu"
Requires-Dist: paddleocr==2.7.3; extra == "cpu"
Requires-Dist: magic-pdf[full]==0.7.1; extra == "cpu"


<div id="top"></div>
<div align="center">

[![license](https://img.shields.io/github/license/InternLM/magic-doc.svg)](https://github.com/InternLM/magic-doc/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/magic-doc)](https://github.com/InternLM/magic-doc/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/magic-doc)](https://github.com/InternLM/magic-doc/issues)

<p align="center">
    👋 join us on <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">WeChat</a>
</p>

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>


### Install

Prerequisites: python3.10

Install Dependencies

**linux/osx** 

```bash
apt-get/yum/brew install libreoffice
```

**windows**
```text
install libreoffice 
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
```


Install Magic-Doc


```bash
pip install qctc-doc[cpu] --extra-index-url https://wheels.myhloli.com # cpu version
or
pip install qctc-doc[gpu] --extra-index-url https://wheels.myhloli.com # gpu version
```



## Introduction

Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.


## Example

```python
# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
```

```python
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)
```

## Performance

ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7

| File Type        | Speed | 
| ------------------ | -------- | 
| PDF (digital)        | 347 (page/s) | 
| PDF (ocr)           | 2.7 (page/s)  | 
| PPT                 | 20 (page/s)   | 
| PPTX                | 149 (page/s)   | 
| DOC                 | 600 (page/s)   | 
| DOCX                | 1482 (page/s)   | 

## All Thanks To Our Contributors:

![image](https://github.com/InternLM/magic-doc/blob/main/assets/contributor.png)


## Acknowledgments

- [Antiword](https://github.com/rsdoiel/antiword)
- [LibreOffice](https://www.libreoffice.org/)
- [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)
- [paddleocr](https://github.com/PaddlePaddle/PaddleOCR)


## 🖊️ Citation

```bibtex
@misc{2024magic-doc,
    title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
    author={Magic-Doc Contributors},
    howpublished = {\url{https://github.com/InternLM/magic-doc}},
    year={2024}
}
```

## License

This project is released under the [Apache 2.0 license](LICENSE).

<p align="right"><a href="#top">🔼 Back to top</a></p>
