Metadata-Version: 2.4
Name: netintel-ocr
Version: 0.1.16.1
Summary: Enterprise OCR Platform with Flow Diagram Support and Customizable Prompts - Network & Process Intelligence
Home-page: https://github.com/VisionMLNet/NetIntelOCR
Author: VisionML
Project-URL: Homepage, https://github.com/VisionMLNet/NetIntelOCR
Project-URL: Issues, https://github.com/VisionMLNet/NetIntelOCR/issues
Project-URL: Changelog, https://github.com/VisionMLNet/NetIntelOCR/changelog.md
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pymupdf>=1.26.1
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pillow>=11.2.1
Requires-Dist: opencv-python-headless>=4.5.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: uvicorn[standard]>=0.23.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: fastmcp>=0.1.0
Requires-Dist: pymilvus>=2.3.0
Requires-Dist: ollama>=0.1.0
Requires-Dist: aioboto3>=12.0.0
Requires-Dist: redis[hiredis]>=5.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: click>=8.1.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: pandas>=2.0.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pyjwt>=2.8.0
Requires-Dist: requests>=2.31.0
Provides-Extra: cpp
Requires-Dist: cmake>=3.12; extra == "cpp"
Requires-Dist: pybind11>=2.6.0; extra == "cpp"
Requires-Dist: scikit-build>=0.11.0; extra == "cpp"
Provides-Extra: dev
Requires-Dist: pytest>=6.0.0; extra == "dev"
Requires-Dist: pytest-cov>=2.10.0; extra == "dev"
Requires-Dist: black>=20.8b1; extra == "dev"
Requires-Dist: flake8>=3.8.0; extra == "dev"
Provides-Extra: all
Requires-Dist: netintel-ocr[cpp]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# NetIntel-OCR (Network Intelligence OCR) v0.1.16

🚀 **Enterprise OCR Platform with Flow Diagram Support and Customizable Prompts!**

[![Version](https://img.shields.io/badge/version-0.1.16-blue)]() [![Python](https://img.shields.io/badge/python-3.10+-green)]() [![Milvus](https://img.shields.io/badge/Milvus-Powered-purple)]() [![Flow Diagrams](https://img.shields.io/badge/Flow-Diagrams-orange)]() [![Prompts](https://img.shields.io/badge/Prompts-Customizable-green)]() [![Docker](https://img.shields.io/badge/docker-ready-blue)]() [![Kubernetes](https://img.shields.io/badge/kubernetes-ready-purple)]() [![API](https://img.shields.io/badge/API-REST-orange)]() [![MCP](https://img.shields.io/badge/MCP-LLM_Ready-purple)]()

NetIntel-OCR is an enterprise-grade platform for extracting intelligence from technical documents. It automatically detects and processes network diagrams, flow diagrams, tables, and text - converting them into structured, searchable formats. With v0.1.16, it now supports process flows, workflows, and decision trees, plus full prompt customization for industry-specific needs.

**🎉 Version 0.1.16 adds Flow Diagram support and Prompt Management! Extract process flows, workflows, and decision trees. Customize all prompts for your specific industry or use case.**

## 🎯 Key Capabilities

### Network Intelligence Extraction
- **Automatic Network Detection**: AI-powered identification of network diagrams in documents
- **Component Recognition**: Identifies routers, switches, firewalls, servers, and other network elements
- **Connection Mapping**: Traces and documents network paths and relationships
- **Security Architecture Analysis**: Extracts security zones, DMZs, and trust boundaries

## ✨ Features

### 🆕 New in v0.1.16 - Flow Diagrams and Prompt Customization
- 📊 **Flow Diagram Support**: Full extraction and analysis of process flows, workflows, and decision trees
- 🔄 **Unified Diagram Detection**: Automatically identifies network, flow, or hybrid diagrams
- 🎯 **Process Intelligence**: Identifies bottlenecks, optimization opportunities, and critical paths
- 📝 **Customizable Prompts**: Export, modify, and import all prompts for industry-specific needs
- 🎨 **Prompt Templates**: Pre-built templates for security, compliance, cloud, and process optimization
- 🔧 **Runtime Overrides**: Change prompts on-the-fly without editing files
- 📈 **Flow Mermaid Generation**: Automatic conversion to flowchart TD/LR format
- 🧠 **Context-Aware Analysis**: Reads 2 paragraphs before/after diagrams for accurate interpretation
- 🔍 **Type-Specific Processing**: Different analysis for network vs flow diagrams
- 🌐 **Hybrid Diagram Support**: Handles diagrams with both network and flow elements

### Previous v0.1.15 - Milvus Vector Database Integration
- 🚀 **20-60x Faster Search**: Sub-100ms query response with Milvus distributed architecture
- 💾 **70% Memory Reduction**: Process 10x more documents with the same hardware
- 🎯 **Enterprise Scale**: From standalone to distributed deployment without code changes
- 🤖 **Qwen3-8B Embeddings**: Advanced 4096-dimensional embeddings via Ollama
- 🔄 **IVF_SQ8 Index**: CPU-optimized scalar quantization for standard hardware
- 📦 **One-Command Setup**: Automatic configuration with `netintel-ocr --init`
- 🐳 **Docker Compose Ready**: Pre-configured stack with etcd, MinIO, and Milvus
- ☸️ **Kubernetes Support**: Production-ready Helm charts for enterprise deployment
- 🔧 **OLLAMA_HOST Detection**: Automatic discovery of Ollama embedding service

### Previous v0.1.14 - High-Performance Deduplication with C++ Core
- ⚡ **50-100x Performance Boost**: C++ core with AVX2 SIMD and OpenMP parallelization
- 🎯 **Three-Level Deduplication**: MD5 (exact), SimHash (fuzzy), CDC (content-level)
- 📦 **Zero-Compilation Install**: Pre-compiled binary wheels for Linux/macOS/Windows
- 🔍 **Near-Duplicate Detection**: SimHash with configurable Hamming distance threshold
- 📊 **Content-Defined Chunking**: Remove repetitive blocks with 30-50% storage reduction
- 🎨 **Version Information**: `netintel-ocr --version` shows C++ core status
- 🔧 **Automatic Fallback**: Python implementation when C++ unavailable

### Previous v0.1.13 - Service-Oriented Architecture
- 🌐 **REST API Server**: FastAPI-based server with full OpenAPI/Swagger documentation
- 🤖 **MCP Server**: Model Context Protocol server for LLM integration
- 📦 **Multi-Scale Deployments**: From single container to enterprise Kubernetes
- 🚀 **Flexible Worker Architecture**: Embedded workers or Kubernetes Jobs

### Previous v0.1.12 - Advanced Database Management
- 🗄️ **Centralized Database Management**: Unified LanceDB with deduplication and MD5 checksums
- 🔍 **Advanced Query Engine**: Vector similarity search with multi-field filtering and reranking
- 📊 **Multiple Output Formats**: JSON, Markdown, and CSV output for queries
- 🚀 **Batch Processing Pipeline**: Parallel PDF processing with progress tracking

### Core Features
- 🚀 **Vector Database Integration (v0.1.7)**: Automatic generation of LanceDB-ready chunks and vector-optimized content
- 🎯 **Intelligent Hybrid Processing**: Automatically detects and processes network diagrams as Mermaid.js, tables as JSON, text as markdown
- 📄 **PDF to Text Conversion**: Convert PDFs to markdown files locally, no token costs
- 🤖 **Multi-Model Support (v0.1.4)**: Use different models for text and network processing for optimal performance
- 📊 **Table Extraction (v0.1.6-v0.1.10)**: Automatic detection and extraction of tables with smart ToC exclusion
- 🖼️ **Visual Understanding**: Turn images and diagrams into detailed text descriptions
- 🔌 **Automatic Network Detection**: No flags needed - network diagrams are detected and converted automatically
- 🎨 **Icons by Default**: Font Awesome icons automatically added to network diagrams for better visualization
- ⏱️ **Smart Timeouts**: Operations timeout gracefully with fallback to simpler methods
- 📊 **Diagram Types Supported**: Network topology, architecture diagrams, data flow diagrams, security diagrams
- 📁 **MD5-Based Organization (v0.1.4)**: Each document stored in unique folder using MD5 checksum
- 📝 **Document Index (v0.1.4)**: Automatic index.md tracking all processed documents
- 📈 **Enhanced Metrics (v0.1.4)**: Comprehensive footer with processing details, errors, and configuration
- ⚡ **Optimized Processing**: Processes up to 100 pages per run with detailed progress tracking
- 🔧 **Flexible Output**: Unified markdown format with seamlessly embedded Mermaid diagrams and tables
- 🔄 **Checkpoint/Resume (v0.1.5)**: Resume interrupted processing from exact stopping point
- 🔍 **Vector Search Ready (v0.1.7)**: Pre-chunked content with minimal metadata for optimal vector search performance
- 🔁 **Vector Regeneration (v0.1.10)**: Regenerate vector files from existing markdown without reprocessing PDFs

## 💼 Use Cases

### Network Documentation
- Convert legacy network diagrams to modern formats
- Extract network topology from vendor documentation
- Audit and inventory network architectures

### Security Analysis
- Map security architecture from compliance documents
- Extract firewall rules and network segmentation
- Document data flow and trust boundaries

### Infrastructure Planning
- Analyze existing network designs
- Extract capacity and redundancy information
- Document interconnections and dependencies

## 📦 Requirements

- Python 3.10+
- Ollama installed and running locally or on a remote server

### Installing Ollama and the Default Model

1. Install [Ollama](https://ollama.com/)
2. Pull the default model:
```bash
ollama run nanonets-ocr-s:latest
```

### Using a Remote Ollama Server

By default, netintel-ocr connects to Ollama running on localhost. To use a remote Ollama server, set the `OLLAMA_HOST` environment variable:

```bash
# Connect to a remote Ollama server
export OLLAMA_HOST="http://192.168.1.100:11434"
netintel-ocr document.pdf

# Or run with the environment variable inline
OLLAMA_HOST="http://remote-server:11434" netintel-ocr document.pdf
```

## Installation

### From PyPI

Install the published version using pip:
```bash
pip install netintel-ocr
```

The package now uses Ollama for embeddings (default: qwen3-embedding:8b with 4096 dimensions), providing superior accuracy with Milvus integration.

or uv:
```bash
uv tool install netintel-ocr
```

### 🚀 Quick Start - Choose Your Deployment Scale (NEW v0.1.15!)

#### Development Scale (1-50 users, up to 1M documents)
```bash
# Initialize development deployment (default)
netintel-ocr --init
# Automatically detects OLLAMA_HOST
# Generates Docker Compose with Milvus Standalone

# Start the stack
cd ~/.netintel-ocr
docker-compose up -d
# Milvus: http://localhost:19530
# API: http://localhost:8000
# MCP: http://localhost:8001
```

#### Production Scale (100+ users, 100M+ documents)
```bash
# Initialize production deployment
netintel-ocr --init --scale production

# Deploy with Kubernetes
helm install netintel-ocr ./helm \
  --namespace netintel-ocr \
  --create-namespace

# Or use Docker with full monitoring
docker-compose -f docker/docker-compose.large.yml up -d
# Grafana: http://localhost:3000
```

## Usage

### 🆕 Server Modes (v0.1.13) - API & MCP Services

#### Start API Server
```bash
# Start REST API server for document processing
netintel-ocr --api
# Access Swagger UI: http://localhost:8000/docs

# With embedded workers (for small deployments)
netintel-ocr --api --embedded-workers --max-workers 2
```

#### Start MCP Server
```bash
# Start Model Context Protocol server for LLM integration
netintel-ocr --mcp
# Connect your LLM to http://localhost:8001
```

#### All-in-One Mode
```bash
# Run everything in a single process (personal use)
netintel-ocr --all-in-one --local-storage --sqlite-queue
# API: http://localhost:8000
# MCP: http://localhost:8001
```

### Traditional CLI Usage (Document Processing)
By default, netintel-ocr automatically:
1. Detects and converts network diagrams to Mermaid.js
2. Extracts tables to structured JSON
3. Generates vector database files (LanceDB-ready chunks)

```bash
# Automatic hybrid mode with vector generation
netintel-ocr path/to/your/file.pdf

# This creates:
# - Human-friendly markdown files
# - document-vector.md (filtered for embeddings)
# - chunks.jsonl (ready for LanceDB ingestion)
# - Complete metadata and schema files
```

### Text-Only Mode
For faster processing when you know the document contains only text:
```bash
netintel-ocr document.pdf --text-only
```

### 🆕 v0.1.15 Commands - Milvus Integration & Vector Search

```bash
# Initialize with Milvus (auto-detects OLLAMA_HOST)
netintel-ocr --init

# Check version and capabilities
netintel-ocr --version
netintel-ocr --version-json  # JSON output with Milvus status

# Process with Milvus vector storage (20-60x faster search)
netintel-ocr document.pdf --vector-db milvus

# Vector similarity search in Milvus
netintel-ocr --search "network topology" \
  --collection netintel_vectors \
  --limit 10

# Process with full deduplication (enhanced with Milvus)
netintel-ocr document.pdf --dedup-mode full

# Find near-duplicates using Milvus binary vectors
netintel-ocr --find-duplicates document.pdf \
  --hamming-threshold 5 \
  --use-milvus

# Show Milvus collection statistics
netintel-ocr --milvus-stats

# Configure advanced processing
netintel-ocr document.pdf \
  --embedding-model qwen3-embedding:8b \
  --index-type IVF_SQ8 \
  --dedup-mode full
```

### v0.1.12 Commands - Database Management

```bash
# Query centralized database with advanced filtering
netintel-ocr --query "network security" \
  --centralized-db ./centralized.lancedb \
  --filters '{"source_type": "network_diagram"}' \
  --output-format json \
  --limit 10

# Merge documents to centralized database
netintel-ocr --merge-to-centralized \
  --output ./output \
  --centralized-db ./unified.lancedb \
  --dedup-strategy md5

# Batch process multiple PDFs with parallel processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 4 \
  --auto-merge

# Database management commands
netintel-ocr --db-stats ./centralized.lancedb
netintel-ocr --db-optimize ./centralized.lancedb --vacuum
netintel-ocr --db-export ./centralized.lancedb --format json
```

**Cloud Workflow with S3/MinIO:**
```bash
# Configure S3/MinIO storage
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_BUCKET=netintel-documents
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Process with cloud storage
netintel-ocr document.pdf --s3-sync --s3-bucket netintel-documents

# Batch process from cloud storage
netintel-ocr --batch-ingest s3://netintel-documents/pdfs/ \
  --output s3://netintel-documents/output/ \
  --parallel-workers 8
```

### Multi-Model Processing (NEW v0.1.4!)

Use different Ollama models optimized for specific tasks:
```bash
# Use fast OCR model for text, powerful model for diagrams
netintel-ocr document.pdf --model nanonets-ocr-s --network-model qwen2.5vl

# Fast processing with lightweight models
netintel-ocr document.pdf --model moondream --network-model bakllava

# Heavy processing for complex network diagrams
netintel-ocr document.pdf --network-model cogvlm --timeout 120
```

**Multi-Model Benefits:**
- 30-50% faster text extraction with OCR-optimized models
- Better diagram understanding with vision-language models
- Resource efficiency by using appropriate model sizes
- Flexibility to experiment with different combinations

**Recommended Model Combinations:**
| Purpose | Text Model | Network Model | Speed |
|---------|------------|---------------|-------|
| Balanced (Default) | nanonets-ocr-s | qwen2.5vl | Medium |
| Fast Processing | moondream | bakllava | Fast |
| Maximum Accuracy | qwen2.5vl | cogvlm | Slow |
| Resource Limited | moondream | llava-phi3 | Fast |

### Table Extraction (NEW v0.1.6!)

NetIntel-OCR now automatically detects and extracts tables from PDFs:
```bash
# Tables are extracted by default in hybrid mode
netintel-ocr document.pdf

# Use library-first extraction for faster processing
netintel-ocr document.pdf --table-method pdfplumber

# Use LLM for complex tables with merged cells
netintel-ocr document.pdf --table-method llm

# Save tables as separate JSON files
netintel-ocr document.pdf --save-table-json

# Disable table extraction for faster processing
netintel-ocr document.pdf --no-tables
```

**Table Extraction Features:**
- **Automatic Detection**: Tables identified alongside network diagrams
- **Multiple Methods**: Library-first (pdfplumber), LLM-enhanced, or hybrid
- **Complex Table Support**: Handles merged cells, multi-row fields, nested headers
- **Structured Output**: Tables converted to JSON with validation
- **Markdown Integration**: Tables embedded in markdown with both rendered and JSON views

### Vector Database Integration (NEW v0.1.7!)

NetIntel-OCR now **automatically generates** vector database files optimized for RAG applications:

```bash
# Vector generation is ON by default - creates LanceDB-ready chunks
netintel-ocr document.pdf

# Disable vector generation (v0.1.6 behavior)
netintel-ocr document.pdf --no-vector

# Customize chunking strategy
netintel-ocr document.pdf --chunk-size 512 --chunk-overlap 50

# Use semantic chunking (default) vs fixed-size
netintel-ocr document.pdf --chunk-strategy semantic
```

**Vector Features:**
- **Automatic Generation**: Creates `document-vector.md` and `chunks.jsonl` by default
- **Content Filtering**: Removes processing artifacts, keeps only source content
- **Minimal Metadata**: Only source filename, page numbers, and indexed date
- **LanceDB Optimized**: Pre-chunked JSONL ready for direct ingestion
- **Smart Chunking**: Semantic boundaries respect document structure

**Using with LanceDB:**
```python
import lancedb
import json

# Load chunks generated by NetIntel-OCR
with open("output/<md5>/lancedb/chunks.jsonl") as f:
    chunks = [json.loads(line) for line in f]

# Create LanceDB table - ready to use!
db = lancedb.connect("./my_lancedb")
table = db.create_table("documents", chunks)

# Search your documents
results = table.search("network configuration").limit(5).to_list()
```

### Performance Optimization

For faster processing of network diagrams, use the `--fast-extraction` flag:
```bash
# Fast extraction mode - reduces extraction time by 50-70%
netintel-ocr document.pdf --fast-extraction

# Combine with multi-model and timeout for best performance
netintel-ocr document.pdf --model nanonets-ocr-s --network-model bakllava --fast-extraction --timeout 30
```

**Fast extraction benefits:**
- Detection: ~15 seconds (vs 30-60s standard)
- Extraction: ~20 seconds (vs 30-60s standard)
- Uses simplified prompts for quicker LLM responses
- Automatic fallback if fast extraction fails

### Command Line Options

#### Basic Options
- `--output`, `-o`: Base output directory (default: "output", documents stored in `output/<md5_checksum>/`)
- `--model`, `-m`: Ollama model for text extraction (default: "nanonets-ocr-s:latest")
- `--network-model`: Separate model for network diagram processing (NEW v0.1.4)
- `--keep-images`, `-k`: Keep the intermediate image files (default: False)
- `--width`, `-w`: Width to resize images to, 0 to skip resizing (default: 0)
- `--start`, `-s`: Start page number (default: 0, processes from beginning)
- `--end`, `-e`: End page number (default: 0, processes to end)
- `--resume`: Resume processing from checkpoint if available (NEW v0.1.5)

#### Processing Mode Options
- `--text-only`, `-t`: Skip network diagram detection for faster text-only processing
- `--network-only`: Process only network diagrams, skip regular text pages

#### Network Diagram Options (applies to default mode)
- `--confidence`, `-c`: Minimum confidence threshold for network diagram detection (0.0-1.0, default: 0.7)
- `--no-icons`: Disable Font Awesome icons in Mermaid diagrams (icons are enabled by default)
- `--diagram-only`: Only extract network diagrams without page text (by default, both are extracted)
- `--timeout`: Timeout in seconds for each LLM operation (default: 60s, increase for complex diagrams)

#### Vector Database Options (NEW v0.1.7)
- `--no-vector`: Disable vector generation (default: enabled)
- `--vector-format`: Target vector DB format (default: lancedb, options: pinecone, weaviate, qdrant, chroma)
- `--chunk-size`: Chunk size in tokens (default: 1000)
- `--chunk-overlap`: Overlap between chunks (default: 100)
- `--chunk-strategy`: Chunking strategy (default: semantic, options: fixed, sentence)
- `--embedding-metadata`: Include extended metadata (reduces content space)

### Examples

#### Basic Usage (with automatic network detection)
```bash
# DEFAULT: Automatic network diagram detection (with icons)
netintel-ocr document.pdf

# Process with custom settings
netintel-ocr document.pdf --confidence 0.8

# Increase timeout for complex diagrams
netintel-ocr document.pdf --timeout 120

# Text-only mode (faster, no detection)
netintel-ocr document.pdf --text-only

# Process specific pages
netintel-ocr document.pdf --start 1 --end 5

# Use a different Ollama model
netintel-ocr document.pdf --model qwen2.5vl:latest
```

#### Specialized Processing
```bash
# Process ONLY network diagrams (skip text pages)
netintel-ocr network-architecture.pdf --network-only

# Higher confidence threshold (stricter detection)
netintel-ocr document.pdf --confidence 0.9

# Disable icons if not needed
netintel-ocr document.pdf --no-icons

# Extract only diagrams without text (faster)
netintel-ocr document.pdf --diagram-only

# Faster text-only processing
netintel-ocr text-document.pdf --text-only
```

Process large documents in sections (max 100 pages per run):
```bash
# Process first 100 pages
netintel-ocr large-document.pdf --start 1 --end 100

# Process next section
netintel-ocr large-document.pdf --start 101 --end 200

# Process specific chapter (e.g., pages 50-100)
netintel-ocr large-document.pdf --start 50 --end 100
```

## Checkpoint/Resume Capability (NEW v0.1.5)

The tool now supports automatic checkpoint saving and resume functionality for long documents:

### How It Works
- **Automatic Saving**: Processing state is saved after each page
- **Checkpoint Location**: Stored in `output/<md5>/.checkpoint/`
- **Resume on Interruption**: Use `--resume` to continue from where you left off
- **Page-Level Tracking**: Each page is tracked individually
- **Smart Skip**: Already processed pages are skipped when resuming

### Usage Examples
```bash
# Start processing a large document
netintel-ocr large-document.pdf

# If interrupted (Ctrl+C, power failure, etc.), resume processing
netintel-ocr large-document.pdf --resume

# Resume with different settings (completed pages are kept)
netintel-ocr large-document.pdf --resume --timeout 120 --network-model qwen2.5vl
```

### Resume Information
When resuming, you'll see a summary like:
```
╔════════════════════════════════════════════════════════════╗
║                  RESUME CHECKPOINT FOUND                   ║
╠════════════════════════════════════════════════════════════╣
║ Previous Processing:                                        ║
║   • Pages completed: 45/100                                ║
║   • Network diagrams found: 5                              ║
║   • Regular pages: 40                                      ║
║   • Failed pages: 0                                        ║
║                                                            ║
║ Resume Information:                                        ║
║   • Will skip 45 already processed pages                   ║
║   • Will process 55 remaining pages                        ║
║   • Starting from page 46                                  ║
╚════════════════════════════════════════════════════════════╝
```

### Benefits
- **No Lost Work**: Never lose progress on long documents
- **Resource Efficient**: Don't reprocess completed pages

## Vector Regeneration (v0.1.10)

### Regenerate Vector Files Without Reprocessing
Use `--vector-regenerate` to regenerate vector database files from existing markdown output:

```bash
# First time processing
netintel-ocr document.pdf

# Regenerate vectors with different chunk settings
netintel-ocr document.pdf --vector-regenerate --chunk-size 500 --chunk-overlap 100

# Change vector database format
netintel-ocr document.pdf --vector-regenerate --vector-format pinecone

# Use different chunking strategy
netintel-ocr document.pdf --vector-regenerate --chunk-strategy sentence
```

### When to Use Vector Regeneration
- **Optimize chunk size**: Adjust for better embedding performance
- **Change vector format**: Switch between LanceDB, Pinecone, Weaviate, etc.
- **Update metadata**: Add or remove extended metadata
- **Fix errors**: Regenerate after fixing vector generation issues
- **Experiment**: Try different strategies without re-OCR

### Benefits
- **Flexible**: Change settings when resuming
- **Automatic**: No manual intervention needed

## Processing Guidelines

### Document Size Recommendations

| Document Size | Processing Strategy | Example |
|--------------|-------------------|---------|
| 1-50 pages | Single run | `netintel-ocr doc.pdf` |
| 51-100 pages | Single run or split | `netintel-ocr doc.pdf` |
| 101-300 pages | Process in 100-page sections | See examples below |
| 300+ pages | Process key sections only | Use specific page ranges |

### Processing Large Documents

For a 250-page document:
```bash
# Section 1: Pages 1-100
netintel-ocr document.pdf --start 1 --end 100 -o output_section1

# Section 2: Pages 101-200
netintel-ocr document.pdf --start 101 --end 200 -o output_section2

# Section 3: Pages 201-250
netintel-ocr document.pdf --start 201 --end 250 -o output_section3
```

## Network Diagram Detection (Now Default!)

**NEW**: Network diagram detection is now enabled by default! No flags needed.

netintel-ocr automatically (in order):

1. **Transcribes** text content FIRST (guaranteed capture)
2. **Detects** network diagrams in PDF pages
3. **Identifies** components (routers, switches, firewalls, servers, databases, etc.)
4. **Extracts** connections and relationships
5. **Converts** to Mermaid.js format
6. **Combines** BOTH the diagram AND the page's text content
7. **Embeds** everything in unified markdown output

### Supported Network Components
- 🔀 Routers and Switches
- 🛡️ Firewalls
- 🖥️ Servers and Workstations
- 💾 Databases
- ⚖️ Load Balancers
- ☁️ Cloud Services
- 📡 Wireless Access Points

### Output Format

Network diagrams are saved as markdown with embedded Mermaid code:

```markdown
# Page 5 - Network Diagram

**Type**: topology
**Detection Confidence**: 0.95
**Components**: 8 detected
**Connections**: 12 detected

## Diagram

```mermaid
graph TB
    Router([Main Router])
    Switch[Core Switch]
    FW{{Firewall}}
    Server1[(Web Server)]
    
    Router --> FW
    FW --> Switch
    Switch --> Server1
```

## Page Text Content

This section describes the SD-WAN architecture with multiple branch offices
connecting to headquarters through various transport methods including MPLS,
broadband, and LTE connections. The solution provides path selection,
application-aware routing, and centralized management...
```

## Output Structure (Enhanced v0.1.4)

All output is organized using MD5 checksums for unique document identification:

```
output/                                    # Base directory (configurable with --output)
├── index.md                              # Master index tracking all processed documents
├── 6c928950e6b73fffe316e0ad6bba3a67/    # MD5 checksum as folder name
│   ├── markdown/                         # All transcribed content
│   │   ├── page_001.md                  # Individual page (text or diagram)
│   │   ├── page_002.md    
│   │   └── document.md                  # Complete merged document with footer metrics
│   ├── images/                          # Original page images (if --keep-images)
│   └── summary.md                       # Processing summary and statistics
└── 0611ca05dab284e943e3b00d3993d424/    # Another document's folder
    └── ...

Benefits:
- Same document won't be processed twice (deduplication)
- Easy to find previous processing results
- index.md provides overview of all processed documents
```

### Index File (output/index.md)
Automatically tracks all processed documents:
```markdown
| Filename | Timestamp | MD5 Checksum | Folder | Processing Time |
|----------|-----------|--------------|--------|----------------|
| network.pdf | 2025-08-20 14:30:15 | `6c9289...` | [📁 6c9289...](./6c9289.../) | 2m 30s |
| manual.pdf | 2025-08-20 14:35:22 | `0611ca...` | [📁 0611ca...](./0611ca.../) | 1m 45s |
```

### Enhanced Footer Metrics (NEW v0.1.4)
Every merged document includes comprehensive processing metrics:
- **Document Info**: Source file, size, MD5 checksum, pages processed
- **Processing Details**: Date/time, models used, processing time, mode
- **Quality Report**: Errors, warnings, success metrics
- **Configuration**: Settings used during processing

## Processing Modes

### Default: Hybrid Mode (Text-First)
- **Text-First Approach**: ALWAYS transcribes text before attempting diagram detection
- **Guaranteed Content**: Text is captured even if diagram processing fails
- **Automatic Detection**: Every page is analyzed for network diagrams
- **Dual Content Extraction**: Pages with diagrams include BOTH Mermaid diagram AND text content
- **Intelligent Processing**: Network diagrams → Mermaid (with icons), Text → Markdown
- **Progress Tracking**: Detailed step-by-step progress messages
- **Smart Timeouts**: Operations timeout after 60s with automatic fallback
- **Processing Time**: 30-60 seconds per page
- **Best For**: Most documents (mixed content)

### Text-Only Mode (`--text-only`)
- **No Detection**: Skip diagram detection for speed
- **Processing Time**: 15-30 seconds per page
- **Best For**: Documents with only text

### Network-Only Mode (`--network-only`)
- **Diagram Focus**: Process only network diagrams
- **Processing Time**: 30-60 seconds per diagram
- **Best For**: Network architecture documents

## Performance & Troubleshooting

### If Processing is Slow or Stuck

The tool now includes detailed progress messages showing what's happening and which models are being used:
```
  Page 3: Processing...
    Transcribing page text (nanonets-ocr-s)... Done (12.3s)  <-- Text captured first!
    Checking for network diagram (qwen2.5vl)... Done (2.1s)
    Network diagram detected (confidence: 0.90)
    Type: topology
    Extracting components (qwen2.5vl)... Done (5.1s)
    Generating Mermaid diagram (qwen2.5vl)... Done (8.2s)
    Validating Mermaid syntax... Valid (0.1s)
    Writing to file... Done (0.1s)
    Total processing time: 27.9s
```

**Important**: Text is ALWAYS transcribed first, so even if diagram processing times out or fails, you'll still have the page content.

If an operation takes too long:
- **Default timeout**: 60 seconds per operation
- **Adjust timeout**: Use `--timeout 120` for complex diagrams
- **Automatic fallback**: If LLM times out, falls back to simpler methods

### Common Issues and Fixes

#### Mermaid Syntax Errors (Robust Auto-Fix)
The tool uses a comprehensive validator to automatically fix Mermaid syntax issues:

**Phase 1 - Basic Cleanup:**
- C-style comments (`//`) → Removed or converted to Mermaid comments (`%%`)
- Curly braces in graph declarations → Removed
- Invalid syntax elements → Cleaned

**Phase 2 - Node ID Fixing:**
- Spaces in node IDs → Converted to underscores (e.g., `Data Center` → `Data_Center`)
- Special characters → Replaced with safe alternatives
- Duplicate node IDs → Automatically numbered (e.g., `Server`, `Server2`, `Server3`)

**Phase 3 - Connection Fixing:**
- Updates all connections to use fixed node IDs
- Preserves connection types and labels
- Maintains directional flow

**Phase 4 - Style Application:**
- Fixes class applications to use corrected node IDs
- Preserves styling and visual attributes

**Examples of Auto-Fixes:**
- `subgraph_DMZ` → `subgraph DMZ`
- `Data Center (HQ)` → `Data_Center_HQ` (as node ID)
- Parentheses in labels → Automatically quoted
- Multiple `Secure SD-WAN` nodes → `Secure_SD_WAN`, `Secure_SD_WAN2`, etc.

### Centralized Database Management (NEW v0.1.12!)

NetIntel-OCR now supports unified database management with advanced query capabilities:

```bash
# Create unified database from per-document databases
netintel-ocr --merge-to-centralized --output ./documents --centralized-db ./unified.lancedb

# Query with advanced filtering and ranking
netintel-ocr --query "firewall configuration" \
  --centralized-db ./unified.lancedb \
  --filters '{"document_type": "network_diagram", "confidence": {"$gte": 0.8}}' \
  --rerank-strategy semantic \
  --output-format json \
  --limit 20

# Get database statistics and health
netintel-ocr --db-stats ./unified.lancedb
netintel-ocr --db-optimize ./unified.lancedb --vacuum --reindex
```

**Key Features:**
- **Deduplication**: Automatic MD5-based duplicate detection
- **Multi-field Filtering**: Query by source, type, confidence, date ranges
- **Reranking**: Semantic, hybrid, and temporal reranking strategies
- **Export Formats**: JSON, Markdown, CSV with customizable fields
- **Validation**: Automatic schema validation and integrity checks
- **Statistics**: Comprehensive database metrics and health monitoring

### Enhanced Batch Processing (NEW v0.1.12!)

Process multiple PDFs efficiently with parallel processing and automatic merging:

```bash
# Batch process directory with parallel workers
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 6 \
  --checkpoint-interval 5 \
  --auto-merge \
  --s3-sync

# Resume interrupted batch processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --resume-batch \
  --skip-existing
```

**Performance Benefits:**
- **Parallel Processing**: Up to 8x faster with multiple workers
- **Progress Tracking**: Real-time progress with ETA and throughput
- **Checkpoint Resume**: Resume from interruption point
- **Memory Management**: Intelligent worker allocation based on system resources
- **Auto-merge**: Automatic centralized database updates

### S3/MinIO Cloud Storage (NEW v0.1.12!)

Full cloud storage integration for distributed deployments:

```bash
# Configure cloud storage
export S3_ENDPOINT=https://minio.company.com
export S3_BUCKET=netintel-docs
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password123

# Process with cloud sync
netintel-ocr document.pdf --s3-sync --s3-backup

# Batch process from cloud
netintel-ocr --batch-ingest s3://netintel-docs/input/ \
  --output s3://netintel-docs/output/ \
  --centralized-db s3://netintel-docs/unified.lancedb
```

**Cloud Features:**
- **Bi-directional Sync**: Upload/download with versioning
- **Backup/Restore**: Automatic backup with retention policies
- **Distributed Access**: Multiple workers can access shared storage
- **Credentials Management**: Support for AWS IAM, MinIO admin, environment variables

### Advanced Embedding Management (NEW v0.1.12!)

Enhanced embedding generation with multiple providers and caching:

```bash
# Configure multiple embedding providers
netintel-ocr document.pdf \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large \
  --embedding-cache-ttl 7200 \
  --batch-size 50

# Use local Ollama embeddings
netintel-ocr document.pdf \
  --embedding-provider ollama \
  --embedding-model mxbai-embed-large \
  --embedding-cache ./embeddings_cache
```

**Embedding Features:**
- **Multiple Providers**: OpenAI, Ollama, HuggingFace support
- **Caching with TTL**: Intelligent caching to avoid recomputation
- **Batch Processing**: Efficient batch embedding generation
- **Model Management**: Automatic model configuration and validation
- **Cost Optimization**: Cache hits reduce API costs by up to 90%

## Recent Improvements

### Version 0.1.12 (Latest - 2025-08-21)
- ✅ **Centralized Database Management**: Unified LanceDB with MD5 deduplication
- ✅ **Advanced Query Engine**: Vector search with filtering, reranking, and multiple output formats
- ✅ **Batch Processing Pipeline**: Parallel PDF processing with progress tracking and checkpoints
- ✅ **S3/MinIO Storage Backend**: Cloud storage integration with bi-directional sync
- ✅ **Enhanced CLI Commands**: --query, --merge-to-centralized, --batch-ingest, --db-stats, --db-optimize
- ✅ **Embedding Management**: Multiple provider support with caching and TTL
- ✅ **Database Optimization**: Validation, statistics, export, and backup capabilities

### Version 0.1.11 (2025-08-21)
- ✅ **Docker Support**: Complete Docker containerization with MinIO integration
- ✅ **Kubernetes Ready**: Full Helm chart for production deployments
- ✅ **Project Initialization**: `--init` command creates complete containerized environment
- ✅ **Configuration Management**: YAML-based configuration with environment variable overrides
- ✅ **Query Interface Foundation**: Query vector databases (enhanced in v0.1.12)
- ✅ **Centralized DB Foundation**: Merge per-document databases (enhanced in v0.1.12)

### Version 0.1.10 (2025-08-20)
- ✅ **Checkpoint/Resume**: Automatic saving and resume capability for long documents
- ✅ **Page-Level Tracking**: Individual page checkpoint tracking
- ✅ **Resume Summary**: Clear display of resume status and remaining work
- ✅ **Atomic Saves**: Checkpoint integrity with atomic file operations
- ✅ **Automatic Cleanup**: Checkpoints removed after successful completion

### Version 0.1.4 (2025-08-20)
- ✅ **Multi-Model Support**: Use different models for text and network processing
- ✅ **MD5-Based Output**: Unique folders per document using MD5 checksums
- ✅ **Document Index**: Automatic index.md tracking all processed documents
- ✅ **Enhanced Footer**: Comprehensive metrics in merged documents
- ✅ **Simplified Defaults**: Output to `output/` instead of timestamped folders
- ✅ **Model Progress Display**: Shows which model is being used for each operation
- ✅ **Deduplication**: Same document uses same output folder

### Version 0.1.3
- ✅ **Hybrid Mode by Default**: Automatic network diagram detection
- ✅ **Text-First Processing**: Guarantees content capture before diagram extraction
- ✅ **Fast Extraction Mode**: 50-70% faster processing option
- ✅ **Enhanced Error Recovery**: Graceful fallbacks and timeout management

### Version 0.1.0 
- ✅ **Initial pypi.org Release**
- ✅ **Fixed Mermaid syntax issues**: Automatically handles parentheses in node labels
- ✅ **Improved component detection**: Fixed issue with multiple types being listed
- ✅ **Enhanced error handling**: Better fallback for malformed LLM responses
- ✅ **Automatic syntax correction**: C-style comments and invalid syntax auto-fixed
- ✅ **Better type selection**: Ensures components have single, specific types

## Limitations

- **Maximum 100 pages per processing run**: This limit ensures optimal processing time and prevents memory issues. For larger documents, use the `--start` and `--end` flags to process specific sections.
- **Network Detection Accuracy**: Detection confidence varies based on diagram complexity and clarity. Adjust the `--confidence` threshold as needed.
- **Model Requirements**: Network detection requires vision-capable models (e.g., nanonets-ocr-s, qwen2.5vl, llava)
- **Timeout Behavior**: Operations that exceed the timeout will fall back to simpler processing methods
