Metadata-Version: 2.4
Name: pmcgrab
Version: 0.3.1
Summary: AI-ready retrieval and parsing of PubMed Central articles for RAG applications
Project-URL: Homepage, https://github.com/rajdeepmondaldotcom/pmcgrab
Project-URL: Documentation, https://github.com/rajdeepmondaldotcom/pmcgrab#readme
Project-URL: Repository, https://github.com/rajdeepmondaldotcom/pmcgrab.git
Project-URL: Issues, https://github.com/rajdeepmondaldotcom/pmcgrab/issues
Project-URL: Changelog, https://github.com/rajdeepmondaldotcom/pmcgrab/releases
Project-URL: Bug Reports, https://github.com/rajdeepmondaldotcom/pmcgrab/issues
Project-URL: Source Code, https://github.com/rajdeepmondaldotcom/pmcgrab
Project-URL: Download, https://pypi.org/project/pmcgrab/
Author-email: Rajdeep Mondal <rajdeep@rajdeepmondal.com>
Maintainer-email: Rajdeep Mondal <rajdeep@rajdeepmondal.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ai,bioinformatics,entrez,llm,ncbi,pmc,pubmed,rag,research papers,retrieval augmented generation,scientific literature,text mining,xml parsing
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Markup :: XML
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.13.4
Requires-Dist: biopython>=1.83
Requires-Dist: certifi>=2024.0.0
Requires-Dist: html5lib>=1.1
Requires-Dist: lxml>=4.9.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: requests>=2.28.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: dev
Requires-Dist: bandit[toml]>=1.7.0; extra == 'dev'
Requires-Dist: build>=0.10.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: safety>=2.0.0; extra == 'dev'
Requires-Dist: twine>=4.0.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.0.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.20.0; extra == 'docs'
Provides-Extra: test
Requires-Dist: pytest-cov>=4.0.0; extra == 'test'
Requires-Dist: pytest-xdist>=3.0.0; extra == 'test'
Requires-Dist: pytest>=7.0.0; extra == 'test'
Requires-Dist: responses>=0.23.0; extra == 'test'
Description-Content-Type: text/markdown

# PMCGrab

From PMCID to structured JSON - bridge PubMed Central and your AI pipeline.

[![PyPI](https://img.shields.io/pypi/v/PMCGrab.svg)](https://pypi.org/project/PMCGrab/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)

`PMCGrab` is a specialised Python toolkit for **retrieving, validating and restructuring PubMed Central (PMC) articles into clean, section-aware JSON** that large-language-model (LLM) pipelines can ingest directly for Retrieval-Augmented Generation (RAG), question-answering, summarisation and other downstream tasks.

---

## Table of Contents

1. [Key Features](#key-features)
2. [Why PMCGrab?](#why-PMCGrab)
3. [Installation](#installation)
4. [Quick Start (Python)](#quick-start-python)
5. [Command-Line Interface](#command-line-interface)
6. [Batch Processing & Scaling](#batch-processing--scaling)
7. [Output Schema](#output-schema)
8. [Configuration](#configuration)
9. [Logging](#logging)
10. [Development](#development)
11. [Contributing](#contributing)
12. [License](#license)
13. [Citation](#citation)
14. [Acknowledgements](#acknowledgements)

---

## Key Features

- **Effortless Retrieval** – Fetch full-text articles with a single PMCID using NCBI Entrez.
- **AI-Optimised JSON** – Output is pre-segmented into `Introduction`, `Methods`, `Results`, `Discussion`, etc., dramatically improving context relevance in RAG pipelines.
- **Highly Concurrent** – Multithreaded batch downloader with configurable worker count, retries and timeouts.
- **HTML & Reference Cleaning** – Utilities to strip or normalise embedded HTML, citations and footnotes.

---

## Why PMCGrab?

While the NCBI Entrez API already provides raw XML, consuming it directly is burdensome:

|                                       | Entrez XML | PMCGrab JSON |
| ------------------------------------- | ---------: | -----------: |
| Section delineation                   |         ❌ |           ✅ |
| Straightforward to embed in vector DB |         ❌ |           ✅ |
| Ready for LLM chunking                |         ❌ |           ✅ |
| Batch parallelism                     |    Limited |    Automatic |

Put simply, `PMCGrab` turns _pubmed central_ papers into _AI-centric_ assets.

---

## Installation

### Requirements

- Python ≥ 3.9
- GCC or compatible compiler for `lxml` wheels on some platforms

### From PyPI

```bash
pip install pmcgrab
```

### From Source

```bash
git clone https://github.com/rajdeepmondaldotcom/pmcgrab.git
cd pmcgrab
pip install .
```

For optional development utilities:

```bash
pip install .[dev,test,docs]
```

---

## Quick Start (Python)

Fetch and inspect a single article:

```python
from pmcgrab import Paper

# NCBI requires an email for identification
paper = Paper.from_pmc("7181753", email="rajdeep@rajdeepmondal.com")

print(paper.title)
print(paper.body["Introduction"][:500])  # first 500 chars of Introduction
```

### Example: Process five PMC articles

Run the helper script located in `examples/run_five_pmcs.py`:

```bash
python examples/run_five_pmcs.py
```

The script will:

1. Download five predefined PMCIDs (see the source).
2. Print a brief summary for each article (title, abstract snippet, author count).
3. Persist the full JSON output into `pmc_output/PMC<id>.json` for further inspection.

---

## Command-Line Interface

Batch-process multiple PMCIDs directly from the shell:

```bash
python -m pmcgrab.cli.pmcgrab_cli \
  --pmcids 7181753 3539614 5454911 \
  --output-dir ./pmc_output \
  --batch-size 8
```

After completion you will find:

```
pmc_output/
├── PMC7181753.json
├── PMC3539614.json
└── PMC5454911.json
```

A `summary.json` file captures success/failure for each ID.

---

## Batch Processing & Scaling

Programmatic interface for large experiments:

```python
from PMCGrab.application.processing import process_pmc_ids

pmc_ids = ["7181753", "3539614", "5454911", ...]
stats = process_pmc_ids(pmc_ids, workers=32)

success_rate = sum(stats.values()) / len(stats)
print(f"{success_rate:%} downloaded successfully")
```

Internally, downloads are sharded across a thread pool and guarded by per-article timeouts.

---

## Output Schema

Below is an abridged view of the generated JSON (actual output contains >30 fields):

```json
{
  "pmc_id": "7181753",
  "title": "...",
  "abstract": "...",
  "authors": [
    {
      "Contributor_Type": "Author",
      "First_Name": "Llorenç",
      "Last_Name": "Solé-Boldo"
    }
  ],
  "body": {
    "Introduction": "The skin is the outermost protective barrier...",
    "Methods": "Clinical samples were obtained...",
    "Results": "...",
    "Discussion": "..."
  },
  "published_date": { "epub": "2020-04-23" },
  "journal_title": "Communications Biology"
}
```

This shape maps cleanly to embeddings or vector stores where section titles become metadata tags for context-aware retrieval.

---

## Configuration

`PMCGrab` follows the 12-factor methodology: **environment variables override defaults**.

| Variable          | Purpose                                                                 | Default              |
| ----------------- | ----------------------------------------------------------------------- | -------------------- |
| `PMCGrab_EMAILS`  | Comma-separated pool of email addresses rotated for API                 | Internal sample list |
| `PMCGrab_WORKERS` | Default worker count for batch processing (if not set programmatically) | `16`                 |

Set them in your shell or orchestrator:

```bash
export PMCGrab_EMAILS="you@org.com,lab@org.com"
export PMCGrab_WORKERS=32
```

---

## Logging

`PMCGrab` uses the standard Python `logging` library and leaves configuration to the host application:

```python
import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
```

Switch to `DEBUG` for verbose network and parsing diagnostics.

---

## Development

1. Clone the repository and install the dev extras:
   `pip install -e .[dev,test,docs]`
2. Run the test-suite (100 % coverage):
   `pytest -n auto --cov=pmcgrab`
3. Lint & type-check:
   `ruff check . && mypy src/pmcgrab`
4. Build documentation (MkDocs):
   `mkdocs serve`

Continuous Integration replicates the above on every pull request.

---

## Contributing

Contributions are welcome! Please read the [CONTRIBUTING.md](.github/CONTRIBUTING.md) for details on:

- Code style and commit guidelines
- Branching and release process
- Reporting bugs or suggesting enhancements
- Security disclosures (please email the maintainer directly)

---

## License

`PMCGrab` is licensed under the [Apache 2.0](LICENSE) License.

---

## Citation

If this project contributes to your research, please consider citing it:

```bibtex
@software{PMCGrab,
  author       = {Rajdeep Mondal},
  title        = {PMCGrab: AI-ready retrieval of PubMed Central articles},
  year         = {2025},
  url          = {https://github.com/rajdeepmondaldotcom/pmcgrab},
}
```

---

## Acknowledgements

- The National Center for Biotechnology Information (NCBI) for maintaining PubMed Central.
- The open-source community behind **Biopython**, **BeautifulSoup**, **lxml** and other dependencies that make this project possible.
