Metadata-Version: 2.1
Name: quackling
Version: 0.4.0
Summary: Quackling enables document-native generative AI applications
Home-page: https://github.com/DS4SD/quackling
License: MIT
Keywords: document,PDF,RAG,generative AI,chunking,docling,llama index
Author: Panos Vagenas
Author-email: pva@zurich.ibm.com
Maintainer: Panos Vagenas
Maintainer-email: pva@zurich.ibm.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Provides-Extra: examples
Requires-Dist: docling (>=1.8.2,<2.0.0)
Requires-Dist: docling-core (>=1.1.2,<2.0.0)
Requires-Dist: flagembedding (>=1.2.10,<2.0.0) ; extra == "examples"
Requires-Dist: jsonpath-ng (>=1.6.1,<2.0.0) ; extra == "examples"
Requires-Dist: langchain-core (>=0.2.38,<0.3.0)
Requires-Dist: langchain-huggingface (>=0.0.3,<0.0.4) ; extra == "examples"
Requires-Dist: langchain-milvus (>=0.1.4,<0.2.0) ; extra == "examples"
Requires-Dist: langchain-text-splitters (>=0.2.4,<0.3.0) ; extra == "examples"
Requires-Dist: llama-index-core (>=0.11.1,<0.12.0)
Requires-Dist: llama-index-embeddings-huggingface (>=0.3.1,<0.4.0) ; extra == "examples"
Requires-Dist: llama-index-llms-huggingface-api (>=0.2.0,<0.3.0) ; extra == "examples"
Requires-Dist: llama-index-postprocessor-flag-embedding-reranker (>=0.2.0,<0.3.0) ; extra == "examples"
Requires-Dist: llama-index-vector-stores-milvus (>=0.2.1,<0.3.0) ; extra == "examples"
Requires-Dist: peft (>=0.12.0,<0.13.0) ; extra == "examples"
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0) ; extra == "examples"
Requires-Dist: torch (>=2.2.2,<2.3.0) ; sys_platform == "darwin" and platform_machine == "x86_64"
Requires-Dist: torch (>=2.2.2,<3.0.0) ; sys_platform != "darwin" or platform_machine != "x86_64"
Requires-Dist: torchvision (>=0,<1) ; sys_platform != "darwin" or platform_machine != "x86_64"
Requires-Dist: torchvision (>=0.17.2,<0.18.0) ; sys_platform == "darwin" and platform_machine == "x86_64"
Project-URL: Repository, https://github.com/DS4SD/quackling
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://github.com/DS4SD/quackling">
    <img loading="lazy" alt="Quackling" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/logo.jpeg" width="150" />
  </a>
</p>

# Quackling

[![PyPI version](https://img.shields.io/pypi/v/quackling)](https://pypi.org/project/quackling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/quackling)](https://opensource.org/licenses/MIT)

Easily build document-native generative AI applications, such as RAG, leveraging [Docling](https://github.com/DS4SD/docling)'s efficient PDF extraction and rich data model — while still using your favorite framework, [🦙 LlamaIndex](https://docs.llamaindex.ai/en/stable/) or [🦜🔗 LangChain](https://python.langchain.com/).

## Features

- 🧠 Enables rich gen AI applications by providing capabilities on native document level — not just plain text / Markdown!
- ⚡️ Leverages Docling's conversion quality and speed.
- ⚙️ Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.

<p align="center">
  <a href="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png">
    <img loading="lazy" alt="Doc-native RAG" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png" width="350" />
  </a>
</p>


## Installation

To use Quackling, simply install `quackling` from your package manager, e.g. pip:

```sh
pip install quackling
```

## Usage

Quackling offers core capabilities (`quackling.core`), as well as framework integration components (`quackling.llama_index` and `quackling.langchain`). Below you find examples of both.

### Basic RAG

Here is a basic RAG pipeline using LlamaIndex:

> [!NOTE]
> To use as is, first `pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api`
> additionally to `quackling` to install the models.
> Otherwise, you can set `EMBED_MODEL` & `LLM` as desired, e.g. using
> [local models](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local).

```python
import os

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from quackling.llama_index.node_parsers import HierarchicalJSONNodeParser
from quackling.llama_index.readers import DoclingPDFReader

DOCS = ["https://arxiv.org/pdf/2206.01062"]
QUESTION = "How many pages were human annotated?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
LLM = HuggingFaceInferenceAPI(
    token=os.getenv("HF_TOKEN"),
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
)

index = VectorStoreIndex.from_documents(
    documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),
    embed_model=EMBED_MODEL,
    transformations=[HierarchicalJSONNodeParser()],
)
query_engine = index.as_query_engine(llm=LLM)
result = query_engine.query(QUESTION)
print(result.response)
# > 80K pages were human annotated
```

### Chunking

You can also use Quackling as a standalone with any pipeline.
For instance, to split the document to chunks based on document structure and returning pointers
to Docling document's nodes:

```python
from docling.document_converter import DocumentConverter
from quackling.core.chunkers import HierarchicalChunker

doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2408.09869").output
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# >     ChunkWithMetadata(
# >         path='$.main-text[4]',
# >         text='Docling Technical Report\n[...]',
# >         page=1,
# >         bbox=[117.56, 439.85, 494.07, 482.42]
# >     ),
# >     [...]
# > ]
```

## More examples

### LlamaIndex

- [Milvus basic RAG (dense embeddings)](examples/llama_index/basic_pipeline.ipynb)
- [Milvus hybrid RAG (dense & sparse embeddings combined e.g. via RRF) & reranker model usage](examples/llama_index/hybrid_pipeline.ipynb)
- [Milvus RAG also fetching native document metadata for search results](examples/llama_index/native_nodes.ipynb)
- [Local node transformations (e.g. embeddings)](examples/llama_index/node_transformations.ipynb)
- ...

### LangChain
- [Milvus basic RAG (dense embeddings)](examples/langchain/basic_pipeline.ipynb)

## Contributing

Please read [Contributing to Quackling](./CONTRIBUTING.md) for details.

## References

If you use Quackling in your projects, please consider citing the following:

```bib
@techreport{Docling,
  author = "Deep Search Team",
  month = 8,
  title = "Docling Technical Report",
  url = "https://arxiv.org/abs/2408.09869",
  eprint = "2408.09869",
  doi = "10.48550/arXiv.2408.09869",
  version = "1.0.0",
  year = 2024
}
```

## License

The Quackling codebase is under MIT license.
For individual component usage, please refer to the component licenses found in the original packages.

