Metadata-Version: 2.1
Name: wisup_e2m
Version: 0.1.4
Summary: Everything to Markdown.
Home-page: https://github.com/wisupai/e2m
License: MIT
Author: Wisup Team
Author-email: team@wisup.a
Requires-Python: >=3.10,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: diskcache (>=5.6.3,<6.0.0)
Requires-Dist: ebooklib (>=0.18,<0.19)
Requires-Dist: litellm (>=1.42.12,<2.0.0)
Requires-Dist: marker-pdf (>=0.2.16,<0.3.0)
Requires-Dist: matplotlib (==3.9.0)
Requires-Dist: nltk (>=3.9,<4.0)
Requires-Dist: pikepdf (>=9.1.0,<10.0.0)
Requires-Dist: pillow-heif (>=0.18.0,<0.19.0)
Requires-Dist: platformdirs (>=4.2.2,<5.0.0)
Requires-Dist: pydantic (>=2.8.2,<3.0.0)
Requires-Dist: pypandoc (>=1.13,<2.0)
Requires-Dist: python-docx (>=1.1.2,<2.0.0)
Requires-Dist: python-pptx (>=1.0.0,<2.0.0)
Requires-Dist: setuptools-rust (>=1.10.1,<2.0.0)
Requires-Dist: speechrecognition (>=3.10.4,<4.0.0)
Requires-Dist: surya-ocr (>=0.4.15,<0.5.0)
Requires-Dist: tomlkit (==0.12.0)
Requires-Dist: torch (==2.3.1)
Requires-Dist: unstructured (>=0.15.0,<0.16.0)
Requires-Dist: unstructured-inference (>=0.7.36,<0.8.0)
Requires-Dist: unstructured-pytesseract (>=0.3.12,<0.4.0)
Project-URL: Repository, https://github.com/wisupai/e2m
Description-Content-Type: text/markdown

<p align="center">
  <img src="docs/images/wisup_e2m_banner.jpg" width="800px" alt="wisup_e2m Logo">
</p>

<p align="center">
    <a href="https://github.com/wisupai/e2m">
        <img src="https://img.shields.io/badge/e2m-repo-blue" alt="E2M Repo">
    </a>
    <a href="https://github.com/Jing-yilin/E2M/tags/0.1.4">
        <img src="https://img.shields.io/badge/version-0.1.2-blue" alt="E2M Version">
    </a>
    <a href="https://www.python.org/downloads/">
        <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11-blue" alt="Python Version">
    </a>
</p>

# E2M: Everything to Markdown

**Everything to Markdown**

E2M is a versatile tool that converts a wide range of file types into Markdown format.

## Supported File Types

-   doc
-   docx
-   epub
-   html
-   htm
-   url
-   pdf
-   pptx
-   mp3
-   m4a

## Installation

To install E2M, use pip:

```bash
pip install wisup_e2m
```

## Usage

Here's a simple example demonstrating how to use E2M:

```python
from wisup_e2m import E2MParser

# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")

# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")

# Print the parsed data as a dictionary
print(data.to_dict())
```

## Config Template

```yaml
parsers:
  doc_parser:
    engine: "unstructured"
    langs: ["en", "zh"]
  docx_parser:
    engine: "unstructured"
    langs: ["en", "zh"]
  epub_parser:
    engine: "unstructured"
    langs: ["en", "zh"]
  html_parser:
    engine: "unstructured"
    langs: ["en", "zh"]
  url_parser:
    engine: "jina"
    langs: ["en", "zh"]
  pdf_parser:
    engine: "marker"
    langs: ["en", "zh"]
  pptx_parser:
    engine: "unstructured"
    langs: ["en", "zh"]
  voice_parser:
    # option 1: use openai whisper api
    # engine: "openai_whisper_api"
    # api_base: "https://api.openai.com/v1"
    # api_key: "your_api_key"
    # model: "whisper"

    # option 2: use local whisper model
    engine: "openai_whisper_local"
    model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages

converter:
  text_converter:
    engine: "litellm"
    model: "deepseek/deepseek-chat"
    api_key: "your_api_key"
    # base_url: ""
  image_converter:
    engine: "litellm"
    model: "gpt-4o-mini"
    api_key: "your_api_key"
    # base_url: ""

```

## Q&A

- Resource wordnet not found.
  - Uninstall `nltk` completely: `pip uninstall nltk`
  - Reinstall `nltk` with the following command: `pip install nltk`
  - Download [corpora/wordnet.zip](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip) manually and unzip it to the directory specified in the error message. Otherwise, you can download it using the following commands:
    - Windows: `wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zip` and `unzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\`
    - Unix: `wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zip` and `unzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/`

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Contact

For any questions or inquiries, please open an issue on [GitHub](https://github.com/wisupai/e2m) or contact us at [team@wisup.ai](mailto:team@wisup.ai).

## 🌟Contributing

<a href="https://github.com/wisupai/e2m/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=wisupai/e2m" />
</a>

