Module: parsers/pdfparser.py

Purpose:

This module provides the implementation for the docling-based PDF parser. This parser is specifically designed for converting content from a PDF file to Markdown and/or HTML format.

Platform:

Linux/Windows | Python 3.11+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

The PDFParser class requires the docling project model to be accessible. The following guidance can be used to obtain the model and set the model’s path in the config file.

Model Pre-Fetching: The docling project model must be downloaded and available for use before this module can be used. Below is guidance for pre-fetching the model for offline usage.

  1. Download the model:

    docling-tools models download \
         --output-dir /path/to/models/docling-project
    
  2. Update config.toml:

    With the docp-core/config/config.toml file, update the docling key in the paths.models table to match the download path specified in the previous step.

GPU Support: GPU support (CUDA) should be enabled automatically by the internals. However, guidance for enabling GPU-support can be found here.

class PDFParser(*args: Any, **kwargs: Any)[source]

Bases: PDFParser

Docling-based PDF parser class.

Parameters:
  • path (str) – Path to the PDF file to be parsed.

  • detailed_extraction (bool, optional) –

    Optimise extraction of additional features such as code and formulae. Defaults to False.

    Tip

    While useful in certain cases, this extraction mode increases processing time by ~2x.

Note

For basic text or table extraction from PDFs, the PDFParser class available from the docp-parsers library is recommended as it’s fast and straightforward.

For converting PDFs into Markdown or HTML formats, this class provides the functionality you need:

As an extension of docp-parsers.PDFParser, it also supports all the core PDF extraction features, so you can also use it for text and table extraction.

Important

If parsing a single document several times, (e.g. for testing different method options) the content of each parse will be appended to the texts attribute. This can lead to unexpected content. If applicable to your use case, ensure to call the initialise() method between parsings to clear the content.

Example:

Parse a PDF into Markdown format:

>>> from docp_docling import PDFParser

# Convert
>>> pdf = PDFParser(path='/path/to/file.pdf')
>>> pdf.to_markdown()

# Access the converted content
>>> pdf.content

# Render extracted text as HTML and preview in a browser.
>>> pdf.preview()

Parse a single page from a PDF into Markdown format, including images, and store to a file:

>>> from docp_docling import PDFParser

# Convert
>>> pdf = PDFParser(path='/path/to/file.pdf')
>>> pdf.to_markdown(page_no=1,
                    image_mode='embedded',  # <-- Include images
                    to_file=True)

# Render extracted text as HTML and preview in a browser.
>>> pdf.preview()

Parse a single page from a PDF into Markdown format, including images, and store to a file (manually):

>>> from docp_docling import PDFParser

# Convert
>>> pdf = PDFParser(path='/path/to/file.pdf')
>>> pdf.to_markdown(page_no=1)

# Render extracted text as HTML and preview in a browser.
>>> pdf.preview()

# Write the converted Markdown content to a file.
>>> pdf.write(ext='.md')

Parse a single page from a PDF into HTML format, including images:

>>> from docp_docling import PDFParser

# Convert
>>> pdf = PDFParser(path='/path/to/file.pdf')
>>> pdf.to_html(page_no=1,
                image_mode='embedded')  # <-- Include images

# Render extracted text and preview in a browser.
>>> pdf.preview(raw=True)
property content: str

Accessor to all content by merging all texts.

Returns:

Returns a continuous string of converted text by joining the content attribute for all elements of the texts property.

Return type:

str

property texts: list

Accessor to parsed text as TextObject instances.

For each text in the list, use the content attribute to access the extracted text.

initialise() None[source]

Clean up the preview extraction activities and start over.

to_html(*, page_no: int = None, image_mode: str = 'placeholder', include_annotations: bool = True, unique_lines: bool = False, to_file: bool = False, auto_open: bool = False, **kwargs) str | None[source]

Convert a PDF to HTML format.

Parameters:
  • page_no (int, optional) – Page number to convert. Defaults to None (for all pages).

  • image_mode (str, optional) – The mode to use for including images in the markdown. Options are: ‘embedded’, ‘placeholder’, ‘referenced’. Defaults to ‘placeholder’.

  • include_annotations (bool, optional) – Whether to include annotations in the export. Defaults to True.

  • unique_lines (bool, optional) – Remove any duplicated lines from the document’s content. Generally used to remove repeated header and footer strings. Defaults to False.

  • to_file (bool, optional) –

    Write the converted text to a text file. Defaults to False.

    Tip

    If you change your mind, call the write() method to store the converted text to a file.

  • auto_open (bool, optional) –

    On completion, display the converted text as rendered HTML in a web browser. Defaults to False.

    Tip

    To view later, simply call the preview() method.

    Ensure to pass raw=True to display the converted HTML in the browser rather than converting HTML to MD and back to HTML.

Keyword Arguments:

All **kwargs are passed directly into docling’s export_to_html() function.

Returns:

If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.

Return type:

str | None

to_markdown(*, page_no: int = None, image_mode: str = 'placeholder', include_annotations: bool = True, unique_lines: bool = False, to_file: bool = False, auto_open: bool = False, **kwargs) str | None[source]

Convert a PDF to Markdown format.

Parameters:
  • page_no (int, optional) – Page number to convert. Defaults to None (for all pages).

  • image_mode (str, optional) – The mode to use for including images in the markdown. Options are: ‘embedded’, ‘placeholder’, ‘referenced’. Defaults to ‘placeholder’.

  • include_annotations (bool, optional) – Whether to include annotations in the export. Defaults to True.

  • unique_lines (bool, optional) – Remove any duplicated lines from the document’s content. Generally used to remove repeated header and footer strings. Defaults to False.

  • to_file (bool, optional) –

    Write the converted text to a text file. Defaults to False.

    Tip

    If you change your mind, call the write() method to store the converted text to a file.

  • auto_open (bool, optional) –

    On completion, display the converted text as rendered HTML in a web browser. Defaults to False.

    Tip

    To view later, simply call the preview() method.

Keyword Arguments:

All **kwargs are passed directly into docling’s export_to_html() function.

Returns:

If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.

Return type:

str | None

preview(raw: bool = False, offline: bool = False, **kwargs) None[source]

Preview the conversion as rendered text in a web browser.

Note

Each conversion (TextObject) is rendered to it own page in the web browser.

Parameters:
  • raw (bool, optional) – If viewing a Markdown formatted file, preview the raw Markdown (i.e. do not render as HTML). Defaults to False.

  • offline (bool, optional) – If True, this prevents ghmdlib from calling the GitHub Markdown conversion API, and performing the conversion internally. Defaults to False.

Keyword Arguments:

These arguments are passed directly into the ghmdlib.ghmd.Converter.convert() method. Refer to that method’s documentation for the accepted arguments.

write(ext: str) str | None[source]

Write the extracted Markdown or HTML content to disk.

Parameters:

ext (str) – File extension to be applied to the output file. For example: '.html'

Returns:

If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.

Return type:

str | None