Metadata-Version: 2.4
Name: mineru-rag
Version: 0.1.1
Summary: A Python package for MinerU document processing and RAG knowledge base construction
Author-email: zhangshuo <zs1907159989@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/mineru-rag
Project-URL: Documentation, https://github.com/yourusername/mineru-rag#readme
Project-URL: Repository, https://github.com/yourusername/mineru-rag
Project-URL: Issues, https://github.com/yourusername/mineru-rag/issues
Keywords: mineru,rag,document-processing,llm,knowledge-base
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Provides-Extra: rag
Requires-Dist: langchain>=0.3.0; extra == "rag"
Requires-Dist: langchain-openai>=0.2.0; extra == "rag"
Requires-Dist: langchain-community>=0.3.0; extra == "rag"
Requires-Dist: faiss-cpu>=1.7.0; extra == "rag"
Requires-Dist: sentence-transformers>=2.2.0; extra == "rag"
Dynamic: license-file

# MinerU RAG

A Python package for MinerU document processing and RAG (Retrieval-Augmented Generation) knowledge base construction.

## Features

- 📄 **MinerU Integration**: Support both online MinerU API and local vLLM backend
- 🤖 **RAG Knowledge Base**: Easy-to-use RAG system for building knowledge bases
- 🔗 **LLM Connection**: Seamless integration with LLM APIs
- 🚀 **Simple API**: Clean and intuitive Python API

## Installation

```bash
pip install mineru-rag
```

For RAG functionality, install with extras:

```bash
pip install mineru-rag[rag]
```

**📖 详细使用文档请查看 [USER_GUIDE.md](USER_GUIDE.md)**

## Quick Start

### 1. Process Documents with MinerU

#### Using Online API

```python
from mineru_rag import MinerUClient

# Initialize client with API token
client = MinerUClient(api_token="your-mineru-api-token")

# Process a single file
result = client.process_file(
    input_path="document.pdf",
    output_path="./output"
)

# Process multiple files
results = client.process_files_batch(
    file_paths=["doc1.pdf", "doc2.pdf"],
    output_dir="./output"
)
```

#### Using Local vLLM Backend

```python
from mineru_rag import MinerUClient

# Initialize client for local mode
# Make sure MinerU vLLM backend is running at http://127.0.0.1:30000
client = MinerUClient(use_local=True, local_url="http://127.0.0.1:30000")

# Process files (same API as online mode)
result = client.process_file(
    input_path="document.pdf",
    output_path="./output"
)
```

### 2. Build RAG Knowledge Base

```python
from mineru_rag import RAGBuilder
from pathlib import Path

# Initialize RAG builder
rag = RAGBuilder()

# Build from processed markdown files
markdown_files = [
    Path("./output/doc1/full.md"),
    Path("./output/doc2/full.md")
]

rag.build_from_files(
    file_paths=markdown_files,
    library_id="my_library"
)

# Or load existing vector store
rag.load_vector_store(library_id="my_library")
```

### 3. Query with LLM

```python
from mineru_rag import LLMClient, RAGBuilder

# Initialize LLM client
llm = LLMClient(
    api_key="your-openai-api-key",
    base_url="http://your-api-server/v1/",
    model="gpt-3.5-turbo"
)

# Initialize RAG builder
rag = RAGBuilder()
rag.load_vector_store(library_id="my_library")

# Query
rag_result = rag.query("What is the main contribution of this paper?", k=4)
answer = llm.query_with_rag(rag_result)

print(answer['answer'])
```

### 4. Complete Workflow

```python
from mineru_rag import MinerUClient, RAGBuilder, LLMClient
from pathlib import Path

# 1. Process documents
client = MinerUClient(api_token="your-mineru-api-token")
result = client.process_file("paper.pdf", "./output")

# 2. Build RAG knowledge base
rag = RAGBuilder()
md_file = Path(result['md_file'])
rag.build_from_files([md_file], library_id="papers")

# 3. Query
llm = LLMClient(
    api_key="your-api-key",
    base_url="http://your-api-server/v1/"
)
rag.load_vector_store("papers")
rag_result = rag.query("What are the key findings?", k=4)
answer = llm.query_with_rag(rag_result)
print(answer['answer'])
```

## Configuration

### Environment Variables

#### MinerU Online API
```bash
export MINERU_API_TOKEN="your-mineru-api-token"
```

#### LLM API
```bash
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="http://your-api-server/v1/"
export OPENAI_MODEL="gpt-3.5-turbo"
export OPENAI_TEMPERATURE="0.7"
```

### Local MinerU vLLM Backend

To use local MinerU vLLM backend:

1. Install MinerU and start vLLM backend:
```bash
# Install MinerU (follow MinerU documentation)
# Start vLLM backend on port 30000
```

2. Use local mode:
```python
client = MinerUClient(use_local=True, local_url="http://127.0.0.1:30000")
```

## Command Line Usage

### Process Documents

```bash
# Online mode
mineru-rag process document.pdf -o ./output --api-token your-token

# Local mode
mineru-rag process document.pdf -o ./output --local --local-url http://127.0.0.1:30000
```

### Build RAG Knowledge Base

```bash
mineru-rag build doc1.md doc2.md -l my_library
```

### Query RAG

```bash
mineru-rag query "What is the main contribution?" -l my_library -k 4
```

## API Reference

### MinerUClient

- `process_file(input_path, output_path, ...)`: Process a single file
- `process_files_batch(file_paths, output_dir, ...)`: Process multiple files

### RAGBuilder

- `build_from_files(file_paths, library_id, ...)`: Build vector database from files
- `load_vector_store(library_id)`: Load existing vector database
- `query(question, k, file_id)`: Query the knowledge base

### LLMClient

- `query(question, context)`: Query LLM with context
- `query_with_rag(rag_result)`: Query LLM with RAG result

## 📚 文档

- **[USER_GUIDE.md](USER_GUIDE.md)** - 完整用户使用文档（推荐阅读）
- **[QUICKSTART.md](QUICKSTART.md)** - 快速开始指南
- **[examples/](examples/)** - 使用示例代码

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

