Metadata-Version: 2.1
Name: pdf2doi
Version: 0.1.4a1
Summary: A python library/command-line tool to retrieve the DOI of a paper from a pdf file.
Home-page: https://github.com/MicheleCotrufo/pdf2doi
Author: Michele Cotrufo
Author-email: michele.cotrufo@gmail.com
License: MIT
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: bibtexparser (>=1.2.0)
Requires-Dist: google (>=3.0.0)
Requires-Dist: pyparsing (>=2.4.7)
Requires-Dist: requests (>=2.25.1)
Requires-Dist: pypdf2 (>=1.26)
Requires-Dist: textract (>=1.6.3)
Requires-Dist: beautifulsoup4 (==4.8.0)
Requires-Dist: feedparser (>=6.0.2)
Requires-Dist: chardet (==3.0.4)

# pdf2doi

pdf2doi is a Python library to extract the DOI or other identifiers (e.g. arXiv) from a pdf file of a publication.

## Installation

Use the package manager [pip] to install pdf2doi.

```bash
pip install pdf2doi
```

## Usage

pdf2doi can be used either as a stand-alone application invoked from the command line, or by importing it in your python project.

Example of usage from command line:

```
pdf2doi 'path/filename.pdf'
pdf2doi './folder'
```
```
pdf2doi --h

usage: pdf2doi [-h] [-v] [-nws] [-nwv] [-google_results GOOGLE_RESULTS] filename

Retrieve the DOI of a paper from a PDF file.

positional arguments:
  filename              Relative path of the pdf file or of a folder.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Increase output verbosity.
  -nws, --nowebsearch   Disable any DOI retrieval method which requires internet searches (e.g. queries to google).
  -nwv, --nowebvalidation
                        Disable the DOI online validation via queries (e.g., to http://dx.doi.org/).
  -google_results GOOGLE_RESULTS
                        Set how many results should be considered when doing a google search for the DOI (default=6).

```
Example of usage inside a python script:
```python
import pdf2doi
#Try to identify the DOI/identifier of the file 'path/filename.pdf'
result = pdf2doi.pdf2doi('path/filename.pdf',verbose=True)
#the output is a list with three strings
#result = [identifier, type_identifier, file_name]

#Try to identify the DOIs of all pdf files contained in the folder
result  = pdf2doi.pdf2doi('./folder',verbose=True,webvalidation=True) 
#The output is a list containing an element for each .pdf file in the folder,
#and each element has the format [identifier, type_identifier, file_name]

```

## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.


## License
[MIT](https://choosealicense.com/licenses/mit/)

