Metadata-Version: 2.4
Name: readanybook
Version: 0.1.22
Summary: A RAG-based cheat sheet generator for books and papers
Home-page: https://github.com/readanybook/readanybook
Author: ReadAnyBook Team
Author-email: ReadAnyBook Team <team@readanybook.dev>
License: MIT
Project-URL: Homepage, https://github.com/readanybook/readanybook
Project-URL: Documentation, https://readanybook.dev/docs
Project-URL: Repository, https://github.com/readanybook/readanybook
Project-URL: Issues, https://github.com/readanybook/readanybook/issues
Keywords: rag,retrieval-augmented-generation,cheat-sheet,pdf,summarization,llm,nlp,education
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Education
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: pypdf>=3.0.0
Requires-Dist: pdfplumber>=0.9.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: ebooklib>=0.18
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: chardet>=5.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: torch>=2.0.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: jinja2>=3.1.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: numpy>=1.24.0
Provides-Extra: cli
Requires-Dist: typer>=0.9.0; extra == "cli"
Requires-Dist: rich>=13.0.0; extra == "cli"
Provides-Extra: api
Requires-Dist: fastapi>=0.100.0; extra == "api"
Requires-Dist: uvicorn>=0.23.0; extra == "api"
Requires-Dist: python-multipart>=0.0.6; extra == "api"
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.6.0; extra == "qdrant"
Provides-Extra: weaviate
Requires-Dist: weaviate-client>=3.24.0; extra == "weaviate"
Provides-Extra: ollama
Requires-Dist: ollama>=0.1.0; extra == "ollama"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.41.0; extra == "quantization"
Requires-Dist: accelerate>=0.24.0; extra == "quantization"
Provides-Extra: eval
Requires-Dist: nltk>=3.8.0; extra == "eval"
Requires-Dist: rouge-score>=0.1.2; extra == "eval"
Provides-Extra: observability
Requires-Dist: opentelemetry-api>=1.20.0; extra == "observability"
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == "observability"
Requires-Dist: opentelemetry-exporter-otlp>=1.20.0; extra == "observability"
Provides-Extra: mlops
Requires-Dist: mlflow>=2.0.0; extra == "mlops"
Requires-Dist: wandb>=0.15.0; extra == "mlops"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: requests>=2.28.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: readanybook[api,cli,eval,mlops,observability,ollama,openai,qdrant,quantization]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# ReadAnyBook 📚

[![PyPI Version](https://img.shields.io/pypi/v/readanybook.svg)](https://pypi.org/project/readanybook/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://img.shields.io/badge/tests-101%20passed-brightgreen.svg)](https://github.com/readanybook/readanybook)

A RAG-based cheat sheet generator that transforms books and papers into structured, 12-page LaTeX cheat sheets.

## 🚀 Try It Now

[![Open In Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/ameck/readanybook-demo)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ameck/readanybook/blob/main/examples/readanybook_demo.ipynb)

## Features

- **Multi-format Document Support**: PDF, EPUB, HTML, LaTeX, Markdown
- **Intelligent Chunking**: Math-aware and code-aware text splitting
- **Hybrid Retrieval**: Dense embeddings + BM25 with reciprocal rank fusion
- **Multi-pass Generation**: Separate extraction for concepts, formulas, algorithms, and models
- **LaTeX Output**: Professional cheat sheets compiled to PDF
- **Multiple LLM Backends**: HuggingFace, Ollama, vLLM, OpenAI-compatible APIs
- **Vector Store Options**: ChromaDB, Qdrant, Weaviate

## Quick Start

### Installation

```bash
# Basic installation
pip install readanybook

# With CLI support
pip install readanybook[cli]

# With all features
pip install readanybook[all]
```

### From Source

```bash
git clone https://github.com/readanybook/readanybook.git
cd readanybook
pip install -e ".[dev]"
```

### Usage

#### Command Line

```bash
# Generate a cheat sheet from a PDF
read-any-book build document.pdf -o cheatsheet.pdf

# Use a specific profile
read-any-book build document.pdf --profile math_paper

# Index a document
read-any-book index document.pdf --collection my_collection

# Search indexed documents
read-any-book search "gradient descent" --collection my_collection
```

#### Python API

```python
from readanybook import CheatSheetPipeline, Settings

# Initialize pipeline
settings = Settings()
pipeline = CheatSheetPipeline(settings)

# Process document
pipeline.ingest("textbook.pdf")
pipeline.index(collection_name="textbook")

# Generate cheat sheet
content = pipeline.generate_content()
cheat_sheet = pipeline.build(content, "output/cheatsheet.pdf")

print(f"Generated: {cheat_sheet.pdf_path}")
```

#### Quick Start (One-Liner)

```python
from readanybook import build_cheatsheet

# Generate a cheat sheet with a single function call
output = build_cheatsheet(
    "textbook.pdf",
    llm_backend="huggingface",      # or "ollama" for local
    llm_model="Qwen/Qwen2.5-1.5B-Instruct",
    output_format="markdown",        # "latex", "markdown", or "both"
    in_memory=True                   # Required for Kaggle/Colab
)
```

#### REST API

```bash
# Start the API server
uvicorn readanybook.api:app --host 0.0.0.0 --port 8000

# Upload a document
curl -X POST "http://localhost:8000/upload" \
  -F "file=@document.pdf" \
  -F "collection_name=my_docs"

# Generate cheat sheet
curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"collection_name": "my_docs", "title": "My Cheat Sheet"}'
```

## Configuration

Create a `config.yaml` file or use environment variables:

```yaml
# Embedding model
embedding:
  model_name: "BAAI/bge-base-en-v1.5"
  device: "cuda"

# Vector store
vectordb:
  store_type: "chroma"
  persist_directory: "./data/chroma"

# LLM settings
llm:
  backend: "ollama"
  model_name: "llama3:8b"

# Retrieval
retrieval:
  mode: "hybrid"
  top_k: 15
  
# LaTeX output
latex:
  columns: 2
  font_size: 10
  paper_size: "a4paper"
```

### Configuration Profiles

Use built-in profiles for different document types:

```bash
# For technical books
read-any-book build book.pdf --profile technical_book

# For math papers
read-any-book build paper.pdf --profile math_paper

# For non-technical books
read-any-book build novel.pdf --profile nontechnical_book
```

## Architecture

ReadAnyBook follows a modular pipeline architecture with clear separation between ingestion, retrieval, and generation layers.

### System Design

```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Ingestion  │───▶│  Chunking   │───▶│  Indexing   │───▶│  Retrieval  │───▶│ Generation  │───▶│   LaTeX     │
│  (PDF/EPUB) │    │  (Math-     │    │ (Embeddings)│    │  (Hybrid    │    │  (LLM +     │    │  (Compile   │
│             │    │   aware)    │    │             │    │   Search)   │    │   RAG)      │    │   to PDF)   │
└─────────────┘    └─────────────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └─────────────┘
                                             │                  │                  │
                                             ▼                  ▼                  ▼
                                      ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
                                      │ Vector DB   │    │ BM25 Index  │    │ LLM Backend │
                                      │ (ChromaDB)  │    │             │    │  (Ollama/   │
                                      │             │    │             │    │   HF/vLLM)  │
                                      └─────────────┘    └─────────────┘    └─────────────┘
```

### Key Components

| Component | Description |
|-----------|-------------|
| **Ingestion** | Multi-format document parsing (PDF, EPUB, HTML, LaTeX) |
| **Chunking** | Hierarchical, semantic, or fixed-size with math/code awareness |
| **Indexing** | BGE/E5 embeddings stored in ChromaDB/Qdrant/Weaviate |
| **Retrieval** | Hybrid dense+sparse search with RRF fusion and cross-encoder reranking |
| **Generation** | Multi-pass extraction: concepts, formulas, algorithms, models |
| **Output** | Jinja2 LaTeX templates compiled to 12-page PDF |

### Package Structure

```
readanybook/
├── core/           # Domain logic
│   ├── ingestion.py    # Document parsing
│   ├── chunking.py     # Text splitting
│   ├── indexing.py     # Embedding & indexing
│   ├── retrieval.py    # Hybrid retrieval
│   ├── models.py       # LLM clients
│   ├── prompts.py      # Prompt templates
│   └── pipeline.py     # Main orchestrator
├── generation/     # Content generation
│   ├── concepts.py     # Concept extraction
│   ├── formulas.py     # Formula extraction
│   ├── algorithms.py   # Algorithm synthesis
│   ├── models_theory.py # Model summarization
│   └── latex_builder.py # LaTeX generation
├── evaluation/     # Quality metrics
│   ├── rag_eval.py     # RAG evaluation
│   └── metrics.py      # Content metrics
├── infra/          # Infrastructure
│   ├── settings.py     # Configuration
│   ├── vectordb.py     # Vector stores
│   ├── logging.py      # Logging
│   └── tracing.py      # Observability
├── api/            # REST API
├── cli/            # Command line interface
├── templates/      # LaTeX templates
└── config/         # Default configs
```

### Design Principles

- **Hexagonal Architecture**: Domain services isolated from external adapters
- **Configuration-Driven**: All behavior controlled via Pydantic settings
- **Pluggable Backends**: LLM, vector store, and embedding model abstractions
- **Observability**: Structured logging and tracing throughout

📄 **Full documentation**: See [docs/architecture.pdf](docs/architecture.pdf) for the complete software architecture document.

## Requirements

- Python 3.10+
- PyTorch 2.0+
- LaTeX distribution (for PDF compilation)
  - TeX Live, MiKTeX, or Tectonic

### LaTeX Installation

```bash
# Ubuntu/Debian
sudo apt install texlive-full

# macOS
brew install --cask mactex

# Or use Tectonic (lightweight)
cargo install tectonic
```

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black readanybook tests
isort readanybook tests

# Type check
mypy readanybook

# Lint
ruff check readanybook
```

## 📓 Kaggle Experiment

We tested ReadAnyBook on Kaggle with Hull's "Options, Futures, and Other Derivatives" (902 pages) using a T4 GPU:

| Model | Backend | Time | Notes |
|-------|---------|------|-------|
| `microsoft/phi-2` | HuggingFace | ~15 min | Fast, basic quality |
| `Qwen/Qwen2.5-1.5B-Instruct` | HuggingFace | ~20 min | Better reasoning |
| `meta-llama/Llama-3.2-3B-Instruct` | HuggingFace | ~30 min | Best quality (needs HF token) |

**Note**: Use `latex_only=True` on Kaggle/Colab since `pdflatex` is not available.

[![Open In Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/code/ameck/readanybook-demo)

## Examples

See the [examples](examples/) directory for:

- 📓 **[readanybook_demo.ipynb](examples/readanybook_demo.ipynb)** - Interactive notebook tutorial
- Processing academic papers
- Creating ML textbook cheat sheets
- Custom template usage
- API integration examples

## Testing

ReadAnyBook has comprehensive test coverage with **101 tests** across three levels:

```bash
# Run all tests
pytest tests/

# Run specific test suites
pytest tests/unit/           # 81 unit tests
pytest tests/integration/    # 6 integration tests
pytest tests/e2e/           # 14 e2e tests (uses Project Gutenberg books)
```

**Test Coverage:**
- **Unit Tests**: Core modules (chunking, ingestion, indexing, retrieval), summarization passes, experiment tracking, regression suite
- **Integration Tests**: Pipeline flows, retrieval service, orchestrator integration
- **E2E Tests**: Full pipeline with real books from Project Gutenberg (Art of War, Flatland, Metamorphosis)

## License

MIT License - see [LICENSE](LICENSE) for details.

## Contributing

Contributions welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) first.

## Acknowledgments

- Built with 🤗 Transformers, ChromaDB, and FastAPI
- Inspired by the need for better study materials
