Metadata-Version: 2.1
Name: docling
Version: 2.5.1
Summary: SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.
Home-page: https://github.com/DS4SD/docling
License: MIT
Keywords: docling,convert,document,pdf,docx,html,markdown,layout model,segmentation,table structure,table former
Author: Christoph Auer
Author-email: cau@zurich.ibm.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Provides-Extra: tesserocr
Requires-Dist: beautifulsoup4 (>=4.12.3,<5.0.0)
Requires-Dist: certifi (>=2024.7.4)
Requires-Dist: deepsearch-glm (>=0.26.1,<0.27.0)
Requires-Dist: docling-core (>=2.3.0,<3.0.0)
Requires-Dist: docling-ibm-models (>=2.0.3,<3.0.0)
Requires-Dist: docling-parse (>=2.0.2,<3.0.0)
Requires-Dist: easyocr (>=1.7,<2.0)
Requires-Dist: filetype (>=1.2.0,<2.0.0)
Requires-Dist: huggingface_hub (>=0.23,<1)
Requires-Dist: marko (>=2.1.2,<3.0.0)
Requires-Dist: pandas (>=2.1.4,<3.0.0)
Requires-Dist: pyarrow (>=16.1.0,<17.0.0)
Requires-Dist: pydantic (>=2.0.0,<3.0.0)
Requires-Dist: pydantic-settings (>=2.3.0,<3.0.0)
Requires-Dist: pypdfium2 (>=4.30.0,<5.0.0)
Requires-Dist: python-docx (>=1.1.2,<2.0.0)
Requires-Dist: python-pptx (>=1.0.2,<2.0.0)
Requires-Dist: requests (>=2.32.3,<3.0.0)
Requires-Dist: rtree (>=1.3.0,<2.0.0)
Requires-Dist: scipy (>=1.14.1,<2.0.0)
Requires-Dist: tesserocr (>=2.7.1,<3.0.0) ; extra == "tesserocr"
Requires-Dist: typer (>=0.12.5,<0.13.0)
Project-URL: Repository, https://github.com/DS4SD/docling
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://github.com/ds4sd/docling">
    <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/docs/assets/docling_processing.png" width="100%"/>
  </a>
</p>

# Docling

<p align="center">
  <a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>

[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/docling)](https://opensource.org/licenses/MIT)

Docling parses documents and exports them to the desired format with ease and speed.

## Features

* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
* 🔍 OCR support for scanned PDFs
* 💻 Simple and convenient CLI

Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!

### Coming soon

* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language
* 🦜🔗 Native LangChain extension

## Installation

To use Docling, simply install `docling` from your package manager, e.g. pip:
```bash
pip install docling
```

Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.

More [detailed installation instructions](https://ds4sd.github.io/docling/installation/) are available in the docs.

## Getting started

To convert individual documents, use `convert()`, for example:

```python
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
```

Check out [Getting started](https://ds4sd.github.io/docling/).
You will find lots of tuning options to leverage all the advanced capabilities.

## Get help and support

Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).

## Technical report

For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).

## Contributing

Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.

## References

If you use Docling in your projects, please consider citing the following:

```bib
@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}
```

## License

The Docling codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.

## IBM ❤️ Open Source AI

Docling has been brought to you by IBM.

