Metadata-Version: 2.4
Name: data_gatherer
Version: 0.1.7
Summary: DataGatherer Library
Author-email: Pietro Marini <pgm7072@nyu.edu>
Maintainer-email: Pietro Marini <pgm7072@nyu.edu>
License: MIT
Project-URL: Homepage, https://github.com/VIDA-NYU/data-gatherer
Project-URL: Repository, https://github.com/VIDA-NYU/data-gatherer
Project-URL: Issues, https://github.com/VIDA-NYU/data-gatherer/issues
Keywords: Information Extraction,NYU
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.11
Description-Content-Type: text/x-rst
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: bs4
Requires-Dist: lxml
Requires-Dist: numpy
Requires-Dist: ollama
Requires-Dist: openai
Requires-Dist: pandas
Requires-Dist: pydantic
Requires-Dist: pydantic_core
Requires-Dist: python-dotenv
Requires-Dist: PyYAML
Requires-Dist: regex
Requires-Dist: requests
Requires-Dist: selenium>=4.28.0
Requires-Dist: tokenizers
Requires-Dist: transformers
Requires-Dist: typing_extensions
Requires-Dist: webdriver-manager
Requires-Dist: google-generativeai
Requires-Dist: tiktoken
Requires-Dist: cloudscraper
Requires-Dist: pyui
Requires-Dist: pysdl2-dll
Requires-Dist: streamlit
Requires-Dist: ipywidgets
Requires-Dist: portkey-ai
Requires-Dist: xlsxwriter
Requires-Dist: sentence-transformers
Requires-Dist: pymupdf
Requires-Dist: json-repair

.. image:: https://readthedocs.org/projects/data-gatherer/badge/?version=latest
   :target: https://data-gatherer.readthedocs.io/en/latest/
   :alt: Documentation Status

Data Gatherer
=============

**Data Gatherer** is a Python library for automatically extracting dataset references from scientific publications.
It processes full-text articles—whether in HTML or XML format—and uses both rule-based and LLM-based methods
to identify and structure dataset citations.

What It Does
------------

- Parses scientific articles from open-access sources like PubMed Central (PMC).
- Extracts dataset mentions from structured sections (e.g., Data Availability, Supplementary Material).
- Supports two main strategies:

  - **Retrieve-Then-Read (RTR)**: First retrieves relevant sections using hand-crafted rules, then applies LLMs.
  - **Full-Document Read (FDR)**: Applies LLMs to the full text without section filtering.

- Outputs structured results in JSON format.
- Includes support for known repositories (e.g., GEO, PRIDE, MassIVE) via a configurable ontology.

.. image:: docs/H_Flowchart.png
   :alt: Flowchart illustrating the main passages in the process_url function.

Use Cases
---------

- Helping data curators and librarians identify datasets cited in publications.
- Supporting meta-analysis and secondary data discovery.
- Enabling dataset indexing and retrieval across the open-access literature.
