Metadata-Version: 2.4
Name: paperpipe
Version: 0.1.1
Summary: Unified paper database for coding agents + PaperQA2
Project-URL: Homepage, https://github.com/hummat/paperpipe
Project-URL: Documentation, https://github.com/hummat/paperpipe#readme
Project-URL: Repository, https://github.com/hummat/paperpipe
Author: Matthias Humt
License: MIT License
        
        Copyright (c) 2025 Matthias Humt
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: arxiv,coding-agent,llm,paperqa,papers,research
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: arxiv>=2.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: requests>=2.28.0
Provides-Extra: all
Requires-Dist: litellm>=1.0.0; extra == 'all'
Requires-Dist: paper-qa>=5.0.0; (python_version >= '3.11') and extra == 'all'
Requires-Dist: paper-qa[pypdf-media]>=5.0.0; (python_version >= '3.11') and extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.0.0; extra == 'dev'
Requires-Dist: pyright>=1.1.385; extra == 'dev'
Requires-Dist: pytest-cov>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: twine>=5.0.0; extra == 'dev'
Provides-Extra: llm
Requires-Dist: litellm>=1.0.0; extra == 'llm'
Provides-Extra: paperqa
Requires-Dist: paper-qa>=5.0.0; (python_version >= '3.11') and extra == 'paperqa'
Provides-Extra: paperqa-media
Requires-Dist: paper-qa[pypdf-media]>=5.0.0; (python_version >= '3.11') and extra == 'paperqa-media'
Description-Content-Type: text/markdown

# paperpipe

A unified paper database for coding agents + [PaperQA2](https://github.com/Future-House/paper-qa).

**The problem:** You want AI coding assistants (Claude Code, Codex CLI, Gemini CLI) to reference scientific papers while implementing algorithms. But:
- PDFs are token-heavy and lose equation fidelity
- PaperQA2 is great for research but not optimized for code verification
- No simple way to ask "does my code match equation 7?"

**The solution:** A local database that stores:
- PDFs (for PaperQA2 RAG queries)
- LaTeX source (for exact equation comparison)
- Summaries optimized for coding context
- Extracted equations with explanations

## Installation

### With uv (recommended)

```bash
# Basic installation
uv pip install paperpipe

# With LLM support (for better summaries/equations)
uv pip install 'paperpipe[llm]'

# With PaperQA2 integration
uv pip install 'paperpipe[paperqa]'

# Everything
uv pip install 'paperpipe[all]'
```

Or install from source:
```bash
git clone https://github.com/hummat/paperpipe
cd paperpipe
uv pip install -e ".[all]"
```

### With pip

```bash
# Basic installation
pip install paperpipe

# With LLM support (for better summaries/equations)
pip install 'paperpipe[llm]'

# With PaperQA2 integration
pip install 'paperpipe[paperqa]'

# With PaperQA2 + multimodal PDF parsing (images/tables; installs Pillow)
pip install 'paperpipe[paperqa-media]'

# Everything
pip install 'paperpipe[all]'
```

Or install from source:
```bash
git clone https://github.com/hummat/paperpipe
cd paperpipe
pip install -e ".[all]"
```

## Development

```bash
# Install app + dev tooling (ruff, pyright, pytest)
uv sync --group dev

uv run ruff check .
uv run pyright
uv run pytest -m "not integration"
```

## Quick Start

```bash
# Add papers (names auto-generated from title; auto-tags from arXiv + LLM)
papi add 2303.13476 2106.10689 2112.03907

# Override auto-generated name with --name (single paper only):
papi add https://arxiv.org/abs/1706.03762 --name attention

# Re-adding the same arXiv ID is idempotent (skips). Use --update to refresh, or --duplicate for another copy:
papi add 1706.03762
papi add 1706.03762 --update --name attention
papi add 1706.03762 --duplicate

# List papers
papi list
papi list --tag sdf

# Search
papi search "surface reconstruction"

# Export for coding session
papi export neuralangelo neus --level equations --to ./paper-context/

# Query with PaperQA2 (if installed)
papi ask "What are the key differences between NeuS and Neuralangelo loss functions?"
```

`papi ask` runs PaperQA2 (`pqa`) directly on your local paper database. The first query may take a while
while PaperQA2 builds its index; subsequent queries reuse it (stored at `<paper_db>/.pqa_index/` by default).
Override the index location by passing `--agent.index.index_directory ...` through to `pqa`, or with
`PAPERPIPE_PQA_INDEX_DIR`.
By default, `papi ask` indexes **PDFs only** (it avoids indexing paperpipe’s generated `summary.md` / `equations.md`
Markdown files by staging PDFs under `<paper_db>/.pqa_papers/`). If you previously ran `papi ask` and PaperQA2
indexed Markdown, delete `<paper_db>/.pqa_index/` once to force a clean rebuild.
You can also override the models PaperQA2 uses for summarization/enrichment with
`PAPERPIPE_PQA_SUMMARY_LLM` and `PAPERPIPE_PQA_ENRICHMENT_LLM` (or pass `--summary_llm` / `--parsing.enrichment_llm`).

## Database Structure

Default database root is `~/.paperpipe/` (override with `PAPER_DB_PATH`; see `papi path`).

```
<paper_db>/
├── index.json                    # Quick lookup index
├── .pqa_papers/                  # PaperQA2 input staging (PDF-only; created on first `papi ask`)
├── .pqa_index/                   # PaperQA2 index cache (created on first `papi ask`)
├── papers/
│   ├── neuralangelo/
│   │   ├── meta.json             # Metadata + tags
│   │   ├── paper.pdf             # For PaperQA2
│   │   ├── source.tex            # Full LaTeX (if available)
│   │   ├── summary.md            # Coding-context summary
│   │   └── equations.md          # Key equations extracted
│   └── neus/
│       └── ...
```

## Integration with Coding Agents

> **Tip:** See [AGENT_INTEGRATION.md](AGENT_INTEGRATION.md) for a ready-to-use snippet you can append to your
> repo's agent instructions file (for example `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`).

### Claude Code / Codex CLI Skill

paperpipe includes a skill that automatically activates when you ask about papers,
verification, or equations. Install it for Claude Code and/or Codex CLI:

```bash
# Install for both Claude Code and Codex CLI
papi install-skill

# Or install for a specific CLI only
papi install-skill --claude
papi install-skill --codex
```

Restart your CLI after installing the skill.

Most coding-agent CLIs can read local files directly. The best workflow is:

1. Use `papi` to build/manage your paper collection.
2. For code verification, have the agent read `{paper}/equations.md` (and `source.tex` when needed).
3. For research-y questions across many papers, use `papi ask` (PaperQA2).

Minimal snippet to add to your agent instructions:

```markdown
## Paper References (PaperPipe)

PaperPipe manages papers via `papi`. Find the active database root with:
`papi path`

Per-paper files are under `<paper_db>/papers/{paper}/`:
- `equations.md` — best for implementation verification
- `summary.md` — high-level overview
- `source.tex` — exact definitions (if available)

Use `papi search "query"` to find papers/tags quickly.
Use `papi ask "question"` for PaperQA2 multi-paper queries (if installed).
```

If you want paper context inside your repo (useful for agents that can’t access `~`), export it:

```bash
papi export neuralangelo neus --level equations --to ./paper-context/
```

If you want to paste context directly into a terminal agent session, print to stdout:

```bash
papi show neuralangelo neus --level eq
```

## Commands

| Command | Description |
|---------|-------------|
| `papi add <ids-or-urls...>` | Add one or more papers (idempotent by arXiv ID; use `--update`/`--duplicate` for existing) |
| `papi regenerate <papers...>` | Regenerate summary/equations/tags (use `--overwrite name` to rename) |
| `papi regenerate --all` | Regenerate for all papers |
| `papi audit [papers...]` | Audit generated summaries/equations and optionally regenerate flagged papers |
| `papi remove <papers...>` | Remove one or more papers (by name or arXiv ID/URL) |
| `papi list [--tag TAG]` | List papers, optionally filtered by tag |
| `papi search <query>` | Exact search (with fuzzy fallback if no exact matches) across title/tags/metadata + local summaries/equations (use `--exact` to disable fallback; `--tex` includes LaTeX) |
| `papi show <papers...>` | Show paper details or print stored content |
| `papi export <papers...>` | Export context files to a directory |
| `papi ask <query> [args]` | Query papers via PaperQA2 (supports all pqa args) |
| `papi models` | Probe which models work with your API keys |
| `papi tags` | List all tags with counts |
| `papi path` | Print database location |
| `papi install-skill` | Install the papi skill for Claude Code / Codex CLI |
| `--quiet/-q` | Suppress progress messages |
| `--verbose/-v` | Enable debug output |

## Tagging

Papers are automatically tagged from three sources:

1. **arXiv categories** → human-readable tags (cs.CV → computer-vision)
2. **LLM-generated** → semantic tags from title/abstract
3. **User-provided** → via `--tags` flag

```bash
# Auto-tags from arXiv + LLM
papi add 2303.13476
# → name: neuralangelo, tags: computer-vision, graphics, neural-radiance-field, sdf, hash-encoding

# Add custom tags (and override auto-name)
papi add 2303.13476 --name my-neuralangelo --tags my-project,priority
```

## Export Levels

```bash
# Just summaries (smallest, good for overview)
papi export neuralangelo neus --level summary

# Equations only (best for code verification)
papi export neuralangelo neus --level equations

# Full LaTeX source (most complete)
papi export neuralangelo neus --level full
```

## Show Levels (stdout)

```bash
# Metadata (default)
papi show neuralangelo

# Print equations (for piping into agent sessions)
papi show neuralangelo neus --level eq

# Print summary / LaTeX
papi show neuralangelo --level summary
papi show neuralangelo --level tex
```

## Workflow Example

```bash
# 1. Build your paper collection (names auto-generated)
papi add 2303.13476 2106.10689 2104.06405
# → neuralangelo, neus, volsdf

# 2. Research phase: use PaperQA2
papi ask "Compare the volume rendering approaches in NeuS, VolSDF, and Neuralangelo"

# 3. Implementation phase: export equations to project
cd ~/my-neural-surface-project
papi export neuralangelo neus volsdf --level equations --to ./paper-context/

# 4. In Claude Code / Codex / Gemini:
# "Compare my eikonal_loss() implementation with the formulations in paper-context/"

# 5. Clean up: remove papers you no longer need
papi remove volsdf neus
```

## Configuration

Set custom database location:
```bash
export PAPER_DB_PATH=/path/to/your/papers
```

## Environment Setup

To use PaperQA2 via `papi ask` with the built-in default models, set the environment variables for your
chosen provider (PaperQA2 uses LiteLLM identifiers for `--llm` and `--embedding`).

| Provider | Required Env Var | Used For |
|----------|------------------|----------|
| **Google** | `GEMINI_API_KEY` | Gemini models & embeddings |
| **Anthropic** | `ANTHROPIC_API_KEY` | Claude models |
| **Voyage AI** | `VOYAGE_API_KEY` | Embeddings (recommended when using Claude) |
| **OpenAI** | `OPENAI_API_KEY` | GPT models & embeddings |

## LLM Support

For better summaries and equation extraction, install with LLM support:

```bash
pip install 'paperpipe[llm]'
# or with uv:
uv pip install 'paperpipe[llm]'
```

This installs LiteLLM, which supports many providers. Set the appropriate API key:

```bash
export GEMINI_API_KEY=...      # For Gemini (default)
export OPENAI_API_KEY=...      # For OpenAI/GPT
export ANTHROPIC_API_KEY=...   # For Claude
```

paperpipe defaults to `gemini/gemini-3-flash-preview`. Override via:
```bash
export PAPERPIPE_LLM_MODEL=gpt-4o  # or any LiteLLM model identifier
```

You can also tune LLM generation:
```bash
export PAPERPIPE_LLM_TEMPERATURE=0.3  # default: 0.3
```

Without LLM support, paperpipe falls back to:
- Metadata + section headings from LaTeX
- Regex-based equation extraction

## PaperQA2 Integration

When both paperpipe and [PaperQA2](https://github.com/Future-House/paper-qa) are installed, they share the same PDFs:

```bash
# paperpipe stores PDFs in <paper_db>/papers/*/paper.pdf (see `papi path`)
# `papi ask` stages PDFs under <paper_db>/.pqa_papers/ so PaperQA2 doesn’t index generated Markdown.
# paperpipe ask routes to PaperQA2 for complex queries

papi ask "What optimizer settings do these papers recommend?"

# PaperQA uses LiteLLM model identifiers for `--llm` and `--embedding`.
# You can also pass through any other `pqa ask` flags after the query/options.
# By default, `papi ask` uses `pqa --settings default` to avoid failures caused by stale user
# settings files; pass `-s/--settings <name>` to use a specific PaperQA2 settings profile.
# `papi ask` also defaults to `--llm gemini/gemini-3-flash-preview` and `--embedding gemini/gemini-embedding-001`
# unless you pick a PaperQA2 settings profile with `-s/--settings` (in that case, the profile controls).
# If Pillow is not installed, `papi ask` also forces `--parsing.multimodal OFF` to avoid PDF
# image extraction errors; pass your own `--parsing...` args to override.
#
# Examples (specify LLM + embedding):
# Gemini 3 Flash + Google Embeddings
papi ask "Explain the architecture" --llm "gemini/gemini-3-flash-preview" --embedding "gemini/gemini-embedding-001"

# Gemini 3 Pro + Google Embeddings
papi ask "Give a detailed derivation of eq. 4 and explain implementation pitfalls" --llm "gemini/gemini-3-pro-preview" --embedding "gemini/gemini-embedding-001"

# Claude Sonnet 4.5 + Voyage AI Embeddings
papi ask "Compare the loss functions" --llm "claude-sonnet-4-5" --embedding "voyage/voyage-3-large"

# GPT-5.2 + OpenAI Embeddings
papi ask "How to implement eq 4?" --llm "gpt-5.2" --embedding "text-embedding-3-large"

# Pass any arbitrary PaperQA2 arguments (e.g., temperature, verbosity)
papi ask "Summarize the methods" --summary-llm gpt-4o-mini --temperature 0.2 --verbosity 2
```

### Model Probing

To see which model ids work with your currently configured API keys (this makes small live API calls):

```bash
papi models
# (default: probes one "latest" completion model and one embedding model per provider for
# which you have an API key set; pass `latest` (or `--preset latest`) to probe a broader list.)
# or probe specific models only:
papi models --kind completion --model gemini/gemini-3-flash-preview --model gemini/gemini-2.5-flash --model gpt-4o-mini
papi models --kind embedding --model gemini/gemini-embedding-001 --model text-embedding-3-small
# probe "latest" defaults (gpt-5.2/5.1, gemini 3 preview, claude-sonnet-4-5; plus text-embedding-3-large if enabled):
papi models latest
# probe "last-gen" defaults (gpt-4.1/4o, gemini 2.5, older/smaller embeddings; Claude 3.5 is retired):
papi models last-gen
# probe a broader superset:
papi models all
# show underlying provider errors (noisy):
papi models --verbose
```

## Non-arXiv Papers

PaperPipe currently focuses on arXiv ingestion (`papi add <arxiv-id-or-url>`). For papers not on arXiv you can still
store files for agents to read, but they will not show up in `papi list/search` unless you also add index/meta
entries.

```bash
PAPER_DB="$(papi path)"
mkdir -p "$PAPER_DB/papers/my-paper"
cp /path/to/paper.pdf "$PAPER_DB/papers/my-paper/paper.pdf"
# Create:
# - "$PAPER_DB/papers/my-paper/summary.md"
# - "$PAPER_DB/papers/my-paper/equations.md"
# (optional) "$PAPER_DB/papers/my-paper/source.tex"
```

## Credits

- **[PaperQA2](https://github.com/Future-House/paper-qa)** by Future House — the RAG engine powering `papi ask`.
  *Skarlinski et al., "Language Agents Achieve Superhuman Synthesis of Scientific Knowledge", 2024.*
  [arXiv:2409.13740](https://arxiv.org/abs/2409.13740)

## License

MIT (see [LICENSE](LICENSE))
