Metadata-Version: 2.4
Name: crab-scholar
Version: 0.2.0
Summary: Research paper analysis pipeline with citation crawling, pluggable LLM prompts, and knowledge graph building
Project-URL: Repository, https://github.com/imnotdev25/crab-scholar
Author: imnotdev25
License: MIT
Keywords: citation-analysis,evaluation,knowledge-graph,llm,papers,research
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: kreuzberg>=4.0.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: networkx>=3.2
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pyvis>=0.3.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: unidecode>=1.3.0
Description-Content-Type: text/markdown

# 🦀 CrabScholar

Research paper analysis pipeline with citation crawling, pluggable LLM prompts, and knowledge graph building.

## Features

- **Multi-input**: Analyze papers by title, DOI, keywords, URL, local PDF, or raw text
- **Citation Crawling**: BFS traversal of references/citations via Semantic Scholar API (configurable depth, default 3)
- **5 Default Analysis Dimensions** (LLM Evaluation focus):
  1. **Paper Analysis** — overview, contributions, methodology
  2. **Dataset Crafting** — data creation, annotation, preprocessing
  3. **Evaluation Method** — benchmarks, baselines, evaluation setup
  4. **Metrics** — specific metrics, reported results
  5. **Statistical Tests** — significance tests, confidence intervals, rigor
- **Pluggable Prompts**: Add YAML files for custom dimensions, override defaults
- **Knowledge Graph**: NetworkX-based graph with paper/author/method/dataset/metric entities
- **Multi-Provider LLM**: Via LiteLLM — OpenAI, Anthropic, Ollama, vLLM, etc. with fallback chain
- **Export**: JSON, GraphML, GEXF, CSV

## Installation

```bash
uv sync
```

## Quick Start

```bash
# Initialize project config
uv run crab init

# Edit .env with your API key
nano .env

# Analyze a paper by title
uv run crab analyze "attention is all you need"

# Search by keywords
uv run crab analyze --keywords "LLM evaluation, benchmark contamination"

# Analyze a local PDF
uv run crab analyze --pdf paper.pdf

# Control crawl depth
uv run crab analyze "GPT-4 Technical Report" --depth 5

# Search without analyzing
uv run crab search "transformer evaluation"

# Build knowledge graph from results
uv run crab build

# Export graph
uv run crab export json
uv run crab export graphml
uv run crab export csv

# List analysis dimensions
uv run crab dimensions

# Show config
uv run crab info
```

## Configuration

Settings load from: CLI flags > env vars (`CRAB_` prefix) > `.env` > `crab.yaml` > defaults.

```yaml
# crab.yaml
default_model: openai/gpt-4o-mini
fallback_models:
  - openai/gpt-3.5-turbo
  - anthropic/claude-3-haiku-20240307

citation_depth: 3
max_papers: 50
output: output
concurrency: 4
```

## Custom Prompts

Create YAML files in a custom directory:

```yaml
# my_prompts/bias_analysis.yaml
name: bias_analysis
display_name: "Bias Analysis"
description: "Analyze papers for bias in LLM evaluation"
system_message: "You are a bias analysis expert..."
extraction_prompt: |
  Analyze the paper for potential biases...
  Paper: {title}
  Text: {paper_text}
  ...
```

Then use: `uv run crab analyze "paper" --prompts-dir my_prompts/`

## Python API

```python
from crab_scholar.pipeline import run_pipeline
from crab_scholar.config import CrabConfig

config = CrabConfig(
    default_model="openai/gpt-4o-mini",
    citation_depth=3,
)

kg = run_pipeline(input_query="attention is all you need", config=config)
print(f"Entities: {kg.entity_count}, Relations: {kg.relation_count}")
```

## Architecture

```
Input (query/DOI/PDF/text)
    ↓
Scholar API → Resolve paper
    ↓
BFS Crawler → Expand citations/references (depth=N)
    ↓
Fetcher → Download PDFs, extract text
    ↓
Analyzer → Run pluggable dimensions (5 defaults)
    ↓
Graph Builder → Entities + Relations → NetworkX
    ↓
Export → JSON / GraphML / GEXF / CSV
```

## License

MIT
