Module: parsers/pdfparser.py
- Purpose:
This module provides the implementation for the docling-based PDF parser. This parser is specifically designed for converting content from a PDF file to Markdown and/or HTML format.
- Platform:
Linux/Windows | Python 3.11+
- Developer:
J Berendt
- Email:
- Comments:
The
PDFParserclass requires thedoclingproject model to be accessible. The following guidance can be used to obtain the model and set the model’s path in the config file.Model Pre-Fetching: The
doclingproject model must be downloaded and available for use before this module can be used. Below is guidance for pre-fetching the model for offline usage.Download the model:
docling-tools models download \ --output-dir /path/to/models/docling-project
Update
config.toml:With the
docp-core/config/config.tomlfile, update thedoclingkey in thepaths.modelstable to match the download path specified in the previous step.
GPU Support: GPU support (CUDA) should be enabled automatically by the internals. However, guidance for enabling GPU-support can be found here.
- class PDFParser(*args: Any, **kwargs: Any)[source]
Bases:
PDFParserDocling-based PDF parser class.
- Parameters:
path (str) – Path to the PDF file to be parsed.
detailed_extraction (bool, optional) –
Optimise extraction of additional features such as code and formulae. Defaults to False.
Tip
While useful in certain cases, this extraction mode increases processing time by ~2x.
Note
For basic text or table extraction from PDFs, the
PDFParserclass available from thedocp-parserslibrary is recommended as it’s fast and straightforward.For converting PDFs into Markdown or HTML formats, this class provides the functionality you need:
HTML: Use the
to_html()method.Markdown: Use the
to_markdown()method.
As an extension of
docp-parsers.PDFParser, it also supports all the core PDF extraction features, so you can also use it for text and table extraction.Important
If parsing a single document several times, (e.g. for testing different method options) the content of each parse will be appended to the
textsattribute. This can lead to unexpected content. If applicable to your use case, ensure to call theinitialise()method between parsings to clear the content.- Example:
Parse a PDF into Markdown format:
>>> from docp_docling import PDFParser # Convert >>> pdf = PDFParser(path='/path/to/file.pdf') >>> pdf.to_markdown() # Access the converted content >>> pdf.content # Render extracted text as HTML and preview in a browser. >>> pdf.preview()
Parse a single page from a PDF into Markdown format, including images, and store to a file:
>>> from docp_docling import PDFParser # Convert >>> pdf = PDFParser(path='/path/to/file.pdf') >>> pdf.to_markdown(page_no=1, image_mode='embedded', # <-- Include images to_file=True) # Render extracted text as HTML and preview in a browser. >>> pdf.preview()
Parse a single page from a PDF into Markdown format, including images, and store to a file (manually):
>>> from docp_docling import PDFParser # Convert >>> pdf = PDFParser(path='/path/to/file.pdf') >>> pdf.to_markdown(page_no=1) # Render extracted text as HTML and preview in a browser. >>> pdf.preview() # Write the converted Markdown content to a file. >>> pdf.write(ext='.md')
Parse a single page from a PDF into HTML format, including images:
>>> from docp_docling import PDFParser # Convert >>> pdf = PDFParser(path='/path/to/file.pdf') >>> pdf.to_html(page_no=1, image_mode='embedded') # <-- Include images # Render extracted text and preview in a browser. >>> pdf.preview(raw=True)
- property texts: list
Accessor to parsed text as TextObject instances.
For each text in the list, use the
contentattribute to access the extracted text.
- to_html(*, page_no: int = None, image_mode: str = 'placeholder', include_annotations: bool = True, unique_lines: bool = False, to_file: bool = False, auto_open: bool = False, **kwargs) str | None[source]
Convert a PDF to HTML format.
- Parameters:
page_no (int, optional) – Page number to convert. Defaults to None (for all pages).
image_mode (str, optional) – The mode to use for including images in the markdown. Options are: ‘embedded’, ‘placeholder’, ‘referenced’. Defaults to ‘placeholder’.
include_annotations (bool, optional) – Whether to include annotations in the export. Defaults to True.
unique_lines (bool, optional) – Remove any duplicated lines from the document’s content. Generally used to remove repeated header and footer strings. Defaults to False.
to_file (bool, optional) –
Write the converted text to a text file. Defaults to False.
Tip
If you change your mind, call the
write()method to store the converted text to a file.auto_open (bool, optional) –
On completion, display the converted text as rendered HTML in a web browser. Defaults to False.
Tip
To view later, simply call the
preview()method.Ensure to pass
raw=Trueto display the converted HTML in the browser rather than converting HTML to MD and back to HTML.
- Keyword Arguments:
All **kwargs are passed directly into docling’s
export_to_html()function.- Returns:
If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.
- Return type:
str | None
- to_markdown(*, page_no: int = None, image_mode: str = 'placeholder', include_annotations: bool = True, unique_lines: bool = False, to_file: bool = False, auto_open: bool = False, **kwargs) str | None[source]
Convert a PDF to Markdown format.
- Parameters:
page_no (int, optional) – Page number to convert. Defaults to None (for all pages).
image_mode (str, optional) – The mode to use for including images in the markdown. Options are: ‘embedded’, ‘placeholder’, ‘referenced’. Defaults to ‘placeholder’.
include_annotations (bool, optional) – Whether to include annotations in the export. Defaults to True.
unique_lines (bool, optional) – Remove any duplicated lines from the document’s content. Generally used to remove repeated header and footer strings. Defaults to False.
to_file (bool, optional) –
Write the converted text to a text file. Defaults to False.
Tip
If you change your mind, call the
write()method to store the converted text to a file.auto_open (bool, optional) –
On completion, display the converted text as rendered HTML in a web browser. Defaults to False.
Tip
To view later, simply call the
preview()method.
- Keyword Arguments:
All **kwargs are passed directly into docling’s
export_to_html()function.- Returns:
If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.
- Return type:
str | None
- preview(raw: bool = False, offline: bool = False, **kwargs) None[source]
Preview the conversion as rendered text in a web browser.
Note
Each conversion (
TextObject) is rendered to it own page in the web browser.- Parameters:
raw (bool, optional) – If viewing a Markdown formatted file, preview the raw Markdown (i.e. do not render as HTML). Defaults to False.
offline (bool, optional) – If
True, this preventsghmdlibfrom calling the GitHub Markdown conversion API, and performing the conversion internally. Defaults to False.
- Keyword Arguments:
These arguments are passed directly into the
ghmdlib.ghmd.Converter.convert()method. Refer to that method’s documentation for the accepted arguments.
- write(ext: str) str | None[source]
Write the extracted Markdown or HTML content to disk.
- Parameters:
ext (str) – File extension to be applied to the output file. For example:
'.html'- Returns:
If the file is written successfully, a string containing the full path to the output file is returned. Otherwise, None.
- Return type:
str | None