Metadata-Version: 2.4
Name: doc-index-mcp
Version: 0.1.0.dev0
Summary: MCP server for semantic document search with boundary-aware chunking
Project-URL: Homepage, https://github.com/mike-anderson/doc-index-mcp
Project-URL: Repository, https://github.com/mike-anderson/doc-index-mcp
Project-URL: Issues, https://github.com/mike-anderson/doc-index-mcp/issues
License: MIT License
        
        Copyright (c) 2026 Mike
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: document-indexing,embeddings,mcp,rag,semantic-search
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Requires-Dist: fastembed>=0.2.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: usearch>=2.0.0
Provides-Extra: apple
Requires-Dist: onnxruntime>=1.17.0; extra == 'apple'
Provides-Extra: cuda
Requires-Dist: onnxruntime-gpu>=1.17.0; extra == 'cuda'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# Doc Index MCP

## What is This For?

A local-first semantic search server for your documents. Index PDFs, Word docs, PowerPoints, Excel files, and text/markdown, then search them using natural language via the Model Context Protocol (MCP).

- **Semantic search** - Find relevant content using natural language queries
- **Boundary-aware chunking** - Respects document structure (chapters, sections, headers)
- **Table extraction** - Extract tables from documents as CSV
- **Fully local** - No external APIs, no cloud services, no PyTorch
- **Lightweight** - ONNX-based embeddings (~50MB vs ~2GB for PyTorch)

## Supported Formats

| Format | Extensions | Notes |
|--------|------------|-------|
| Text | `.txt` | Plain text |
| Markdown | `.md`, `.markdown` | Preserves headers for boundaries |
| PDF | `.pdf` | Text extraction with page markers |
| Word | `.docx` | Paragraphs, headings, tables |
| PowerPoint | `.pptx` | Slides, notes, tables |
| Excel | `.xlsx`, `.xls` | Sheets as tables |

### Why No External Services?

| Component | Traditional RAG | This Server |
|-----------|-----------------|-------------|
| Embeddings | OpenAI API / hosted model | Local ONNX model (fastembed) |
| Vector DB | Pinecone / Weaviate / Qdrant | Local file (usearch) |
| Storage | Cloud / managed DB | Local `.docindex/` directory |
| Dependencies | PyTorch (~2GB) | ONNX Runtime (~50MB) |

## Tools

### `doc_index`
Index a document for semantic search.

```json
{
  "file_path": "docs/manual.pdf",
  "source_name": "manual"
}
```

### `doc_search`
Search indexed documents using natural language.

```json
{
  "query": "how to configure authentication",
  "top_k": 5,
  "expand_to_boundary": "section",
  "max_return_tokens": 4096
}
```

Parameters:
- `query` - Search query
- `sources` - Filter to specific sources (optional)
- `top_k` - Number of results (default: 5)
- `expand_to_boundary` - Expand results to full "section" or "chapter"
- `max_return_tokens` - Token budget for results (default: 4096)
- `include_siblings` - Include sibling sections when expanding

### `doc_list`
List all indexed sources.

### `doc_chunk`
Retrieve a specific chunk by ID with optional neighbors.

```json
{
  "chunk_id": "manual:42",
  "neighbors": 2
}
```

### `read_document`
Read a document without indexing. Returns formatted text.

```json
{
  "file_path": "report.pdf",
  "max_chars": 100000
}
```

### `list_tables`
List all tables in a document.

```json
{
  "file_path": "data.xlsx"
}
```

### `extract_table`
Extract a specific table as CSV.

```json
{
  "file_path": "data.xlsx",
  "table_index": 0,
  "max_rows": 100
}
```

## Installation

```bash
pip install -r requirements.txt
```

Or with uv:

```bash
uv pip install -r requirements.txt
```

## Configuration

Add to your Claude Desktop or MCP client config:

```json
{
  "mcpServers": {
    "doc-index": {
      "command": "python",
      "args": ["/path/to/doc-index-mcp/src/server.py"],
      "env": {
        "MCP_WORKING_DIR": "/path/to/your/project",
        "DOC_INDEX_DIR": "/path/to/store/indices"
      }
    }
  }
}
```

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `MCP_WORKING_DIR` | Base directory for resolving file paths | Current working directory |
| `DOC_INDEX_DIR` | Directory for storing vector indices | `.docindex` in working dir |

## Architecture

Everything runs locally - no external APIs, databases, or embedding servers required.

```mermaid
flowchart TB
    subgraph Client["MCP Client (Claude Desktop, etc.)"]
        LLM[LLM]
    end

    subgraph MCP["Doc Index MCP Server"]
        Server[server.py]

        subgraph Services["Local Services"]
            Loader[Document Loader<br/>PDF, DOCX, PPTX, XLSX]
            Chunker[Boundary-Aware<br/>Chunker]
            Embedder[Embedder<br/>ONNX Runtime]
            VectorStore[Vector Store<br/>usearch]
        end
    end

    subgraph Storage["Local Filesystem"]
        Docs[(Source<br/>Documents)]
        Index[(".docindex/<br/>├── manifest.json<br/>└── vectors/<br/>    ├── index.usearch<br/>    ├── chunks.jsonl<br/>    └── boundaries.json")]
    end

    subgraph Models["Embedded Model (downloaded once)"]
        ONNX[BAAI/bge-small-en-v1.5<br/>ONNX format ~50MB]
    end

    LLM <-->|MCP Protocol| Server
    Server --> Loader
    Server --> Chunker
    Server --> Embedder
    Server --> VectorStore

    Loader -->|read| Docs
    VectorStore <-->|read/write| Index
    Embedder -->|load once| ONNX

    style Client fill:#e1f5fe
    style Storage fill:#fff3e0
    style Models fill:#f3e5f5
    style MCP fill:#e8f5e9
```

### Data Flow

```mermaid
flowchart LR
    subgraph Index["Indexing"]
        direction TB
        A[Document] --> B[Load & Extract Text]
        B --> C[Detect Boundaries]
        C --> D[Chunk ~256 tokens]
        D --> E[Generate Embeddings]
        E --> F[Save to Disk]
    end

    subgraph Search["Searching"]
        direction TB
        G[Query] --> H[Embed Query]
        H --> I[Vector Similarity Search]
        I --> J[Expand to Boundaries]
        J --> K[Return Results]
    end

    Index -.->|stored in .docindex/| Search
```

## License

MIT
