Metadata-Version: 2.4
Name: pdfannots
Version: 0.5
Summary: Tool to extract and pretty-print PDF annotations for reviewing
Project-URL: Homepage, https://github.com/0xabu/pdfannots
Author-email: Andrew Baumann <pdfannots.pypi.org@ab.id.au>
License: MIT License
        
        Copyright (c) Microsoft Corporation (2016-2022) and Andrew Baumann (2022-). All rights reserved.
        
        Permission is hereby granted, free of charge, to any person obtaining
        a copy of this software and associated documentation files (the
        "Software"), to deal in the Software without restriction, including
        without limitation the rights to use, copy, modify, merge, publish,
        distribute, sublicense, and/or sell copies of the Software, and to
        permit persons to whom the Software is furnished to do so, subject to
        the following conditions:
        
        The above copyright notice and this permission notice shall be
        included in all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
        NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
        LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
        OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
        WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
License-File: LICENSE.txt
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Requires-Dist: pdfminer-six!=20240706,>=20220319
Description-Content-Type: text/markdown

## pdfannots

[![Build status](https://github.com/0xabu/pdfannots/actions/workflows/python-checks.yml/badge.svg)](https://github.com/0xabu/pdfannots/actions/workflows/python-checks.yml)
[![PyPI version](https://img.shields.io/pypi/v/pdfannots)](https://pypi.org/project/pdfannots/)

This program extracts annotations (highlights, comments, etc.) from a PDF file,
and formats them as Markdown or exports them to JSON. It is primarily intended
for use in reviewing submissions to scientific conferences/journals.

![Sample/demo of pdfannots extracting Markdown from an annotated PDF](doc/demo.png)

For the default Markdown format, the output is as follows:

 * Highlights without an attached comment are output first, as
   "highlights" with just the highlighted text included. Note that
   these are not typically suitable for use in a review, since they're
   unlikely to have any meaning to the recipient; they are just meant
   to serve as a reminder to the reviewer.

 * Highlights with an attached comment, and text annotations (not
   attached to any particular text/highlight) are output next, as
   "detailed comments". Typically most comments on a reviewed paper
   are of this form.

 * Underline, strikeout, and squiggly underline annotations are output
   last, as "Nits", with or without an attached comment. The intention
   of this is to easily separate formatting or grammatical corrections
   from more substantial comments about the content of the document.

For each annotation, the page number is given, along with the associated
(highlighted/underlined) text, if any. Additionally, if the document embeds
outlines (aka bookmarks), such as those generated by the LaTeX
[hyperref](https://ctan.org/pkg/hyperref) package, they are printed to help
identify to which section in the document the annotation refers.


### Installation

To install the latest released version from PyPI, use a command such as:
```
python3 -m pip install pdfannots
```


### Usage

See `pdfannots --help` (in a source tree: `pdfannots.py --help`) for
options and invocation.


### Dependencies

 * Python >= 3.8
 * [pdfminer.six](https://github.com/pdfminer/pdfminer.six)


### Known issues and limitations

 * While it is generally reliable, pdfminer (the underlying PDF parser) is
   not infallible at extracting text from a PDF. It has been known to fail
   in several different ways:

    * Sometimes it misses or misplaces individual characters, resulting in
      annotations with some or all of the text missing (in the latter case,
      you'll see a warning).

    * Sometimes the characters are captured, but not spaces between the words.
      Tweaking the advanced layout analysis parameters (e.g., `--word-margin`)
      may help with this.

    * Sometimes it extracts all the text but renders it out of order, for
      example, reporting that text at the top of a second column comes before
      text at the end of the first column. This causes pdfannots to return the
      annotations out of order, or to report the wrong outlines (section
      headings) for annotations. You can mostly work around this issue by using
      the `--cols` parameter to force a fixed page layout for the document
      (e.g. `--cols=2` for a typical 2-column document).

 * If an annotation (such as a StrikeOut) covers solely whitespace, no text is
   extracted for the annotation, and it will be skipped (with a warning). This
   is an artifact of the way pdfminer reports whitespace with only an implicit
   position defined by surrounding characters.

 * When extracting text, we remove all hyphens that immediately precede a line
   break and join the adjacent words. This usually produces the best results
   with LaTeX multi-column documents (e.g. "soft-`\n`ware" becomes "software"),
   but sometimes the hyphen needs to stay (e.g. "memory-`\n`mapped", which will be
   extracted as "memorymapped"), and we can't tell the difference. To disable
   this behaviour, pass `--keep-hyphens`.


### FAQ

 1. I'd like to change how the output is formatted.

    Some minor tweaks (e.g.: word wrap, skipping or reordering output sections)
    can be accomplished via command-line arguments.

    All of the output comes from the relevant `Printer` subclass; more elaborate
    changes can be accomplished there. Pull requests to introduce new output
    formats or variants as printers are welcomed.

 2. I think I got a review generated by this tool...

    I hope that it was a constructive review, and that the annotations
    helped the reviewer give you more detailed feedback so you can improve
    your paper. This is, after all, just a tool, and it should not be an
    excuse for reviewer sloppiness.
