Metadata-Version: 2.4
Name: arango-entity-resolution
Version: 3.1.1
Summary: A production-ready entity resolution system for ArangoDB using record blocking, graph algorithms, and AI.
Project-URL: Homepage, https://github.com/your-repo/arango-entity-resolution
Project-URL: Documentation, https://github.com/your-repo/arango-entity-resolution#readme
Project-URL: Issues, https://github.com/your-repo/arango-entity-resolution/issues
Author: Entity Resolution Team
License-Expression: Apache-2.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Requires-Dist: click>=8.1.7
Requires-Dist: fuzzywuzzy>=0.18.0
Requires-Dist: jellyfish>=1.0.3
Requires-Dist: numpy>=1.24.4
Requires-Dist: pandas>=2.1.4
Requires-Dist: python-arango>=8.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: python-levenshtein>=0.23.0
Requires-Dist: rich>=13.7.0
Requires-Dist: scikit-learn>=1.3.2
Provides-Extra: dev
Requires-Dist: black>=23.11.0; extra == 'dev'
Requires-Dist: flake8>=6.1.0; extra == 'dev'
Requires-Dist: mypy>=1.7.1; extra == 'dev'
Provides-Extra: ml
Requires-Dist: sentence-transformers>=2.2.0; extra == 'ml'
Requires-Dist: torch>=2.0.0; extra == 'ml'
Provides-Extra: test
Requires-Dist: pytest-cov>=4.1.0; extra == 'test'
Requires-Dist: pytest>=7.4.3; extra == 'test'
Description-Content-Type: text/markdown

# ArangoDB Advanced Entity Resolution System

**Current Version**: 3.1.0-stable | [Version History](VERSION_HISTORY.md) | [Changelog](CHANGELOG.md)

## What's New in v3.1

Version 3.1 introduces **Entity Resolution Enrichments** - specialized components for technical, hierarchical, and domain-specific entity resolution:

### Entity Resolution Enrichments
- **`TypeCompatibilityFilter`** - Pre-filter candidates by compatibility matrix to prevent nonsensical matches
- **`HierarchicalContextResolver`** - Use parent context to disambiguate similar names in hierarchical data
- **`AcronymExpansionHandler`** - Handle domain-specific abbreviations and acronyms during search
- **`RelationshipProvenanceSweeper`** - Remap relationships through consolidation with full audit trails
- **Cross-Domain Support** - Validated on Hardware ER and Medical domains

[Enrichments Guide](docs/enrichments.md) | [Examples](examples/enrichments/domain_agnostic_examples.py)

---

## What's New in v3.0

Version 3.0 introduces **general-purpose ER components** extracted from production implementations, enabling configuration-driven ER pipelines:

### Core Similarity Component
- **`WeightedFieldSimilarity`** - Standalone reusable similarity computation component
- Multiple algorithms (Jaro-Winkler, Levenshtein, Jaccard)
- Configurable field weights and null handling
- String normalization options
- Can be used independently or with batch services

### Enhanced Clustering
- **`WCCClusteringService`** - Now supports multiple algorithms:
- **Python DFS** - Reliable across all ArangoDB versions, uses bulk edge fetching
- **AQL Graph** (default) - Server-side processing for large graphs
- Eliminates N+1 query problems with single bulk edge fetch

### Address Entity Resolution
- **`AddressERService`** - Complete address deduplication pipeline
- Custom analyzer setup for address normalization
- ArangoSearch view configuration
- Blocking with registered agent handling
- Edge creation and optional clustering
- Configurable field mapping (works with any address schema)

### Configuration-Driven ER
- **`ERPipelineConfig`** - YAML/JSON-based ER pipeline configuration
- **`ConfigurableERPipeline`** - Run complete ER pipelines from configuration files
- Automatic service instantiation
- Validation and error handling
- Standardized ER patterns

### Key Benefits
- **92% code reduction** - Consuming projects reduce from 1,863 to 155 lines
- **50-100x performance** improvement for similarity computation
- **Standardized ER patterns** across all projects
- **Configuration-driven** - No code changes needed to tune ER parameters
- **Production-proven** - Battle-tested components from real-world implementations

[See Migration Guide](docs/guides/MIGRATION_GUIDE_V3.md) | [API Reference](docs/api/API_REFERENCE.md) | [Examples](examples/enhanced_er_examples.py)

---

## What's New in v2.1

**NEW** - Tier 3 (vector blocking) for semantic similarity-based entity resolution using vector embeddings:

### Vector Search Components
- **`EmbeddingService`** - Generate and manage vector embeddings for database records
- Pre-trained sentence-transformers models (see [model comparison](config/vector_search_setup.md#recommended-models))
- Batch processing (1000+ records/batch)
- Automatic storage in ArangoDB with metadata tracking
- Coverage statistics and monitoring

- **`VectorBlockingStrategy`** - Tier 3 (vector blocking) for semantic similarity
- Cosine similarity matching with configurable threshold
- Finds fuzzy matches that exact/text blocking miss (typos, abbreviations, variations)
- Optional geographic/categorical constraints
- Similarity distribution analysis for threshold tuning
- Integrates seamlessly with existing Tier 1 (exact) and Tier 2 (fuzzy text) blocking

### Key Benefits
- **Semantic matching** - Captures meaning beyond text similarity
- **Handles variations** - Typos, abbreviations, different phrasings
- **Configurable precision** - Tune similarity threshold for your data
- **Production ready** - 700+ tests, comprehensive documentation
- **Research-based** - Implements Ebraheem et al. (2018) tuple embeddings

[Quick Start](config/vector_search_setup.md) | [Example](examples/vector_blocking_example.py) | [API Reference](docs/api/API_REFERENCE.md#embedding-service)

---

## What's New in v2.0

Version 2.0 introduced **powerful new components** for production-grade entity resolution:

### Enhanced Blocking Strategies
- **`CollectBlockingStrategy`** - Efficient composite key blocking (phone+state, address+zip, etc.)
- **`BM25BlockingStrategy`** - Fast fuzzy text matching using ArangoSearch (400x faster than Levenshtein)

### Optimized Similarity & Clustering
- **`BatchSimilarityService`** - Batch document fetching with multiple algorithms (Jaro-Winkler, Levenshtein, Jaccard)
- **`SimilarityEdgeService`** - Bulk edge creation with metadata tracking
- **`WCCClusteringService`** - Server-side AQL graph clustering (handles millions of edges)

---

## Business Value & Impact

Entity resolution is a critical data quality challenge that directly impacts business outcomes across industries. Organizations typically lose **15-25% of revenue** due to duplicate customers, incomplete profiles, and fragmented data views. This system delivers measurable business value:

### **Revenue Protection & Growth**
- **Eliminate Revenue Leakage**: Prevent duplicate customer acquisition costs and conflicting outreach
- **Complete Customer 360 View**: Unified customer profiles enable targeted marketing and personalized experiences 
- **Improve Conversion Rates**: Accurate customer data increases campaign effectiveness by 20-40%
- **Reduce Operational Costs**: Automated deduplication saves hours of manual data cleaning

### **Compliance & Risk Management**
- **Regulatory Compliance**: Meet GDPR, CCPA data accuracy requirements
- **Risk Mitigation**: Identify hidden relationships and potential fraud patterns
- **Audit Trail**: Complete lineage tracking for data governance
- **Data Quality Assurance**: Continuous monitoring and validation of data integrity

### **Strategic Decision Making**
- **Accurate Analytics**: Clean, consolidated data improves business intelligence accuracy
- **Customer Lifetime Value**: Complete customer journeys enable better retention strategies
- **Market Segmentation**: Precise customer profiling for targeted product development
- **Operational Efficiency**: Streamlined processes through automated data consolidation

## Advanced Entity Resolution Techniques

This system implements a comprehensive, multi-stage entity resolution pipeline that combines traditional and cutting-edge AI/ML techniques:

### **1. Record Blocking (Foundation)**
**Full-Text Search for Candidate Generation** - The essential first step that makes large-scale entity resolution computationally feasible by reducing O(n^2) comparisons to O(n).

### **2. Graph Algorithms (Network Analysis)**
**Weakly Connected Components** - Identify entities connected through shared attributes (phone numbers, emails, addresses) to discover potential aliases and entity networks.

### **3. GraphML & Embeddings (Behavioral Analysis)**
**Vertex and Edge Embeddings** - Create vector representations of entities and their connections (behavioral patterns) to identify similar entities through geometric proximity in embedding space.

### **4. Vector Search (Semantic Similarity)**
**ArangoSearch Vector Capabilities** - Use embedding-based similarity to find entities that are semantically related, even without exact attribute matches.

### **5. GraphRAG & LLM Entity Extraction**
**Generative AI Document Processing** - Extract entities from unstructured documents using LLMs, with embeddings enabling semantic similarity-based entity resolution across document collections.

### **6. Geospatial Analysis (Temporal-Spatial Validation)**
**Location-Time Verification** - Determine whether similar entities are at the same place at the same time as confirmation for deduplication, or reject matches when proven to be in different locations simultaneously.

### **7. LLM-Based Curation (Intelligent Decision Making)**
**AI-Powered Match Evaluation** - Use Large Language Models to act as automated curators, evaluating similarity evidence from multiple techniques to make final entity resolution decisions with human-like reasoning.

---

## Why ArangoDB for Advanced Entity Resolution?

### **The Multi-Model Advantage**

Entity Resolution requires multiple data operations that traditional databases handle poorly in isolation:

#### **Document Storage & Retrieval**
Entity resolution starts with diverse, semi-structured data from multiple sources. ArangoDB's native document model excels at:
- **Flexible Schema**: Handle varying record structures without rigid table schemas
- **Rich Data Types**: Support complex nested objects, arrays, and mixed data types
- **Fast Ingestion**: Efficient bulk loading from CSV, JSON, XML, and API sources
- **Version Management**: Track data lineage and changes over time

#### **Graph Relationships & Analysis**
Entities exist in networks of relationships that relational databases struggle to model:
- **Native Graph Storage**: Model customer-company, person-address, and entity-entity relationships naturally
- **Graph Algorithms**: Built-in algorithms for clustering, community detection, and similarity scoring
- **Traversal Performance**: Fast relationship queries across millions of connected entities
- **Pattern Detection**: Identify complex relationship patterns indicating duplicate entities

#### **Full-Text Search & Similarity**
Traditional entity resolution bottleneck: comparing every record with every other record (O(n^2) complexity):
- **ArangoSearch Integration**: Elasticsearch-like capabilities natively integrated
- **Custom Analyzers**: Phonetic (Soundex), n-gram, stemming, and text normalization
- **Real-Time Indexing**: Immediate search availability as data loads
- **Fuzzy Matching**: Built-in edit distance, token matching, and similarity scoring

#### **Vector Search & Embeddings**
Modern AI-powered entity resolution through semantic similarity:
- **Native Vector Support**: Store and search high-dimensional embeddings directly in ArangoDB
- **GraphML Integration**: Generate node and edge embeddings from graph structure
- **ANN Search**: Approximate Nearest Neighbor search for fast similarity queries
- **Multi-Modal Embeddings**: Support for text, behavioral, and structural embeddings

#### **Geospatial Capabilities**
Location and time-based entity validation:
- **Native Geospatial Indexes**: GeoJSON support with spatial queries
- **Distance Calculations**: Determine if entities could be at same location
- **Temporal Queries**: Time-based filtering and validation
- **Spatial-Temporal Joins**: Correlate entity movements and interactions

### **Why Record Blocking as Our Foundation**

#### **The Scalability Challenge**
Without record blocking, entity resolution doesn't scale:
- **Naive Approach**: 1 million records = 500 billion comparisons
- **With Blocking**: Same dataset = ~50 million comparisons (99%+ reduction)
- **Performance Impact**: Hours become minutes, impossible becomes practical

#### **What is Record Blocking?**
Record blocking is a preprocessing technique that groups potentially similar records together, dramatically reducing the number of comparisons needed:

1. **Blocking Key Generation**: Create simplified representations of records (e.g., first 3 chars of name + zipcode)
2. **Candidate Selection**: Only compare records that share blocking keys
3. **Similarity Computation**: Apply expensive algorithms only to promising candidates
4. **Result Integration**: Combine results across different blocking strategies

#### **ArangoDB's Unique Record Blocking Advantages**

**Integrated Full-Text Search**
Most graph databases (Neo4j, Amazon Neptune) require external search engines for text-based blocking:
```
Traditional Approach: ArangoDB -> Elasticsearch -> Application -> Neo4j
Our Approach: ArangoDB <-> Application
```

**Multi-Strategy Blocking in Single Queries**
ArangoDB enables sophisticated blocking strategies impossible in other systems:

- **Exact Blocking**: `FOR doc IN customers FILTER doc.email == @target_email`
- **Phonetic Blocking**: `FOR doc IN customers FILTER SOUNDEX(doc.last_name) == SOUNDEX(@target_name)`
- **N-gram Blocking**: `FOR doc IN customers FILTER NGRAM_MATCH(doc.company, @target_company, 0.8)`
- **Sorted Neighborhood**: `FOR doc IN customers SORT doc.normalized_name LIMIT @window`

**Native Performance Optimization**
- **Persistent Indexes**: ArangoSearch indexes persist across restarts
- **Memory Management**: Intelligent caching of frequently accessed blocking keys
- **Parallel Processing**: Multi-threaded search across index segments
- **Query Optimization**: Automatic optimization of blocking key combinations

### **How Record Blocking Integrates with ArangoDB Features**

#### **Stage 1: Record Blocking (Foundation)**
Record blocking leverages ArangoDB's search capabilities to create efficient candidate pairs:

- **ArangoSearch Analyzers**: Use phonetic, n-gram, and text analyzers for flexible blocking keys
- **Multi-Index Strategy**: Create multiple blocking indexes for different similarity aspects
- **Dynamic Blocking**: Adjust blocking strategies based on data characteristics
- **Real-Time Updates**: Blocking keys update automatically as new records arrive

**Example AQL for Multi-Strategy Blocking:**
```aql
// Exact email blocking
FOR candidate IN customers
SEARCH candidate.email == @target_email
RETURN candidate

// Phonetic name blocking 
FOR candidate IN customers
SEARCH ANALYZER(candidate.last_name, "soundex") == ANALYZER(@target_name, "soundex")
RETURN candidate

// N-gram company blocking
FOR candidate IN customers 
SEARCH NGRAM_MATCH(candidate.company, @target_company, 0.8, "bigram")
RETURN candidate
```

#### **Stage 2: Similarity Computation (Precision)**
After blocking reduces candidates, apply sophisticated similarity algorithms:

- **Document Comparison**: Compare full record structures using ArangoDB's document capabilities
- **Field-Weighted Scoring**: Different importance for names, emails, phones, addresses
- **Probabilistic Methods**: Fellegi-Sunter framework for match/non-match classification
- **Custom Functions**: Foxx microservices for performance-critical similarity computations

#### **Stage 3: Graph-Based Clustering (Relationships)**
Use ArangoDB's graph features to group similar records into entities:

- **Weakly Connected Components**: Native graph algorithm for entity clustering
- **Similarity Edges**: Model similarity scores as weighted graph edges
- **Transitive Relationships**: If A matches B and B matches C, consider A-C relationship
- **Cluster Validation**: Graph metrics to assess cluster quality and detect over-clustering

#### **Stage 4: Golden Record Generation (Consolidation)**
Combine clustered records into authoritative master records:

- **Source Ranking**: Prioritize data from most reliable sources
- **Conflict Resolution**: Rules-based and ML approaches for conflicting values
- **Completeness Optimization**: Select most complete data across cluster members
- **Audit Trail**: Graph edges preserve lineage from golden record to source records

### **ArangoDB's Competitive Advantages for Entity Resolution**

## Project Overview

This **production-ready** entity resolution system identifies and links records from multiple data sources that refer to the same real-world entity. The system uses **record blocking as a strategic first step** to dramatically improve efficiency, followed by sophisticated graph-based algorithms and modern AI techniques for comprehensive entity resolution.

## System Architecture

### **High-Level Architecture**

The system follows a layered architecture with clear separation of concerns:

- **Data Sources**: CRM, Marketing, Sales, Support, External APIs
- **Entity Resolution Engine**: Core processing with ArangoDB multi-model database
- **Output & Integration**: Golden records, clusters, reports, API endpoints
- **Presentation System**: Interactive demos and stakeholder presentations

![High-Level Architecture](docs/diagrams/high-level-architecture.svg)

> **Detailed System Architecture**: See [system-architecture.svg](docs/diagrams/system-architecture.svg) for a comprehensive view of all system components, data flows, and integrations.

### **Component Architecture**

The system is organized into four main layers:

- **Core Services Layer**: BlockingService, SimilarityService, ClusteringService, GoldenRecordService
- **Data Management Layer**: DataManager, DatabaseManager, FoxxServices
- **Infrastructure Layer**: ArangoDB Multi-Model, ArangoSearch, Graph Algorithms
- **Presentation Layer**: Interactive Demos, Database Inspector, Business Analytics

![Component Architecture](docs/diagrams/component-architecture.svg)

### **ArangoDB Multi-Model Integration**

The system leverages ArangoDB's unique multi-model capabilities:

- **Document Store**: Raw records, golden records, metadata with ACID transactions
- **Graph Database**: Similarity edges, entity clusters, relationships with native algorithms
- **Search Engine**: Blocking indexes, text analyzers, fuzzy search with real-time indexing

![ArangoDB Multi-Model Integration](docs/diagrams/arango-multimodel.svg)

## Advanced Entity Resolution Workflow

![Entity Resolution Workflow](docs/diagrams/workflow.svg)

### **Complete Pipeline Flow**

The entity resolution process follows a comprehensive multi-stage pipeline combining traditional and AI-powered techniques:

**Stage 1: Data Ingestion & Preparation**
- Multiple data sources (CRM, Marketing, Sales, Support, Documents) -> Validate & Normalize -> Feature extraction

**Stage 2: Record Blocking (Foundational Filtering)**
- Full-text search with ArangoSearch -> Generate candidate pairs -> 99%+ reduction in comparisons
- Strategies: Exact matching, Phonetic (Soundex), N-gram, Sorted neighborhood

**Stage 3: Traditional Similarity Computation**
- Candidate pairs -> Field-level similarity (Jaro-Winkler, Levenshtein, Jaccard) -> Fellegi-Sunter scoring

**Stage 4: Graph Algorithm Analysis**
- Build entity graph -> Weakly Connected Components -> Identify alias networks through shared identifiers (phone, email, address)

**Stage 5: Embedding-Based Similarity**
- Generate GraphML embeddings (node + edge features) -> Vector search -> Find semantically similar entities

**Stage 6: Document Entity Extraction (GraphRAG)**
- LLM-based entity extraction from documents -> Generate embeddings -> Link to existing entities via semantic similarity

**Stage 7: Geospatial-Temporal Validation**
- Location-time analysis -> Validate or reject matches based on spatial-temporal feasibility
- Confirm: Same place, same time -> Reject: Proven to be in different locations

**Stage 8: LLM-Based Curation**
- Aggregate evidence from all techniques -> LLM evaluation -> Final entity resolution decisions
- Human-like reasoning over similarity scores, graph connections, embeddings, and spatial-temporal data

**Stage 9: Golden Record Generation**
- Resolved entity clusters -> Data fusion -> Master record creation with complete lineage

> See the [Entity Resolution Workflow diagram](#entity-resolution-workflow) above for the traditional pipeline. Advanced stages (GraphML, GraphRAG, Geospatial, LLM curation) will be added in future diagram updates.

### **Detailed Workflow Stages**

The entity resolution process consists of five detailed stages (see [workflow diagram](#entity-resolution-workflow) above):

**Stage 1: Data Ingestion & Preprocessing**
- Raw data sources -> Data quality assessment -> Schema normalization -> ArangoDB document store

**Stage 2: Record Blocking (Candidate Generation)**
- Full dataset -> Multiple blocking strategies (Exact, Phonetic, N-gram) -> Candidate pairs (99% reduction)

**Stage 3: Similarity Computation & Classification**
- Candidate pairs -> Field-level similarity -> Probabilistic scoring (Fellegi-Sunter) -> Decision classification

**Stage 4: Graph-Based Clustering**
- Similarity graph -> Graph construction -> Connected components (WCC) -> Cluster validation

**Stage 5: Golden Record Generation**
- Entity clusters -> Source prioritization -> Conflict resolution -> Master record creation

### **Performance & Scalability**

The system demonstrates exceptional scalability through record blocking (see [workflow diagram](#entity-resolution-workflow) for the complete pipeline):

**Scale Analysis:**
- **10K Records**: Naive 50M pairs -> Blocked 500K pairs -> 2 seconds
- **100K Records**: Naive 5B pairs -> Blocked 5M pairs -> 20 seconds 
- **1M Records**: Naive 500B pairs -> Blocked 50M pairs -> 3 minutes
- **10M Records**: Naive 50T pairs -> Blocked 500M pairs -> 30 minutes

**Key Performance Metrics:**
- **99%+ pair reduction** through intelligent blocking strategies
- **Linear scalability** with record blocking optimization
- **Sub-second response** for real-time applications
- **Horizontal scaling** with ArangoDB cluster coordination

### **Competitive Advantages**

#### **vs. Traditional Graph Databases**
- **Neo4j/Neptune**: Require external search systems (Elasticsearch, Solr) for text-based blocking
- **ArangoDB**: Native full-text search with custom analyzers eliminates external dependencies
- **Result**: 50% reduction in infrastructure complexity and maintenance overhead

#### **vs. Relational Databases** 
- **PostgreSQL/MySQL**: Limited graph capabilities, complex JOIN operations for clustering
- **ArangoDB**: Native graph algorithms (WCC, shortest paths) with superior performance
- **Result**: 10x faster clustering operations and natural relationship modeling

#### **vs. Search-Only Solutions**
- **Elasticsearch/Solr**: Excellent for blocking but limited analytical capabilities
- **ArangoDB**: Combines search excellence with graph analytics and ACID transactions
- **Result**: Complete pipeline in single system with data consistency guarantees

#### **Future-Ready Architecture**
- **AI Integration**: Ready for graph embeddings, vector search, and LLM integration
- **Multi-Modal**: Document storage, graph relationships, and search in unified queries
- **Scalability**: Horizontal scaling with cluster coordination and sharding
- **Performance**: In-memory caching with persistent storage for optimal speed

## Project Structure

The project is organized into logical modules for maintainability and scalability:

**Core Implementation (`src/`)**:
- `entity_resolution/` - Main package with core services, data management, and utilities
- `core/` - Entity resolver orchestration and pipeline coordination
- `services/` - Blocking, similarity, clustering, and golden record services
- `data/` - Data management, ingestion, and validation
- `utils/` - Configuration, logging, database utilities, and constants

**Demo & Presentation (`demo/`)**:
- `scripts/` - Interactive and automated demo scripts
- `data/` - Demo datasets and industry scenarios
- `templates/` - Presentation templates and dashboards
- `PRESENTATION_SCRIPT.md` - Complete presentation guide

**High-Performance Services (`foxx-services/`)**:
- `entity-resolution/` - ArangoDB Foxx microservices for native performance

**Documentation (`docs/`)**:
- `PRD.md` - Product Requirements Document
- `TESTING.md` - Comprehensive testing guide (setup, strategies, automation)
- `diagrams/` - Mermaid diagrams for architecture and workflows

**Research & Utilities**:
- `research/` - Academic papers and research materials
- `scripts/` - Database management, testing, and deployment tools
- `foxx/` - Foxx deployment automation
- `benchmarks/` - Performance testing tools
- `examples/` - Usage examples and integration demos
- `tests/` - Test framework and validation
- `config/` - Configuration files and templates
- `docker-compose.yml` - ArangoDB container configuration

## Key Features

### **[IMPLEMENTED] Foundation: Traditional Entity Resolution**
- **Data Management**: Import and manage customer data from multiple sources
- **Record Blocking**: Multi-strategy blocking (exact, n-gram, phonetic) with 99%+ efficiency
- **Bulk Processing**: 3-5x faster for large datasets (50K+ records) using set-based AQL operations
- **Similarity Matching**: Fellegi-Sunter probabilistic framework with configurable metrics
- **Graph-Based Clustering**: Weakly Connected Components for entity grouping
- **Golden Record Generation**: Automated master record creation with conflict resolution
- **Data Quality Scoring**: Comprehensive validation and quality assessment

### **[IMPLEMENTED] Core Infrastructure** 
- **ArangoSearch Integration**: Native full-text search for blocking operations
- **Graph Algorithms**: Built-in WCC and relationship discovery (Python DFS and AQL options)
- **Foxx Microservices**: High-performance ArangoDB-native services
- **Batch & Bulk Processing**: Dual-mode architecture (real-time + batch optimization)
- **Configuration Management**: Environment-based settings with validation
- **YAML/JSON Configuration**: Configuration-driven ER pipelines (v3.0)
- **Performance Optimization**: 1,000+ records/second processing capability

### **[IMPLEMENTED] v3.0 General-Purpose Components**
- **WeightedFieldSimilarity**: Standalone similarity computation component
- **AddressERService**: Complete address deduplication pipeline
- **Python DFS Clustering**: Reliable WCC clustering option
- **ERPipelineConfig**: YAML/JSON configuration system
- **ConfigurableERPipeline**: Run ER from configuration files

### **[ROADMAP] Advanced AI/ML Capabilities**

**Graph Embeddings & Vector Search**
- **GraphML Integration**: Generate node and edge embeddings from entity graphs
- **Behavioral Embeddings**: Capture entity behavior patterns in vector space
- **Vector Similarity Search**: ArangoSearch vector capabilities for semantic matching
- **Approximate Nearest Neighbor**: Fast embedding-based similarity queries

**GraphRAG & LLM Integration**
- **Document Entity Extraction**: Use LLMs to extract entities from unstructured text
- **Semantic Entity Linking**: Connect extracted entities via embedding similarity
- **Knowledge Graph Construction**: Build comprehensive entity knowledge graphs
- **LLM-Powered Curation**: Automated evaluation of entity match evidence

**Geospatial-Temporal Analysis**
- **Location-Time Validation**: Verify entity co-location for match confirmation
- **Spatial Impossibility Detection**: Reject matches for entities proven to be in different locations
- **Movement Pattern Analysis**: Track entity trajectories for behavior-based matching
- **Temporal Consistency Checks**: Ensure entity timelines are logically consistent

**Advanced Alias Detection**
- **Shared Identifier Networks**: Graph analysis to find entities sharing phone/email/address
- **Transitive Alias Resolution**: Multi-hop alias discovery through graph traversal
- **Confidence Scoring**: Probabilistic scoring of alias relationships
- **Network Visualization**: Interactive exploration of entity alias networks

### **[IMPLEMENTED] Demo & Presentation System**
- **Interactive Presentations**: Step-by-step demos with manual pace control
- **Business Impact Calculator**: ROI and cost-benefit analysis tools
- **Database Inspector**: Real-time visualization of entity resolution process
- **Multiple Demo Modes**: Presentation, automated, and quick demo options
- **Industry Scenarios**: Pre-built examples for healthcare, finance, retail, B2B

## Technology Stack

### **Core Platform**
- **Database**: ArangoDB 3.12+ (multi-model: document + graph + search)
- **Language**: Python 3.8+ (with comprehensive type hints)
- **Driver**: python-arango 8.0.0 (full ArangoDB 3.12 compatibility)
- **Microservices**: ArangoDB Foxx Services (JavaScript/V8)

### **Infrastructure & Deployment**
- **Containerization**: Docker & Docker Compose
- **Configuration**: Environment-based with validation
- **Logging**: Structured logging with multiple output formats
- **Monitoring**: Performance metrics and health checks

### **Algorithms & AI**

**Traditional Techniques (Implemented)**
- **Similarity**: Fellegi-Sunter probabilistic framework
- **Blocking**: Multi-strategy (exact, n-gram, phonetic, sorted neighborhood)
- **Clustering**: Graph-based Weakly Connected Components
- **Search**: ArangoSearch with custom analyzers (Soundex, n-gram)
- **Quality**: Data quality scoring and validation frameworks

**Advanced AI/ML (Roadmap)**
- **Embeddings**: GraphML for node/edge embeddings, behavioral pattern vectors
- **Vector Search**: ArangoSearch vector similarity, ANN (Approximate Nearest Neighbor)
- **LLM Integration**: Entity extraction, semantic linking, automated curation
- **GraphRAG**: Document understanding with knowledge graph construction
- **Geospatial**: GeoJSON support, spatial-temporal validation
- **Deep Learning**: Graph Neural Networks for entity matching

### **Development & Testing**
- **Architecture**: Modular service-oriented design
- **Testing**: Comprehensive test framework with validation
- **Documentation**: API documentation and presentation materials
- **Code Quality**: Centralized configuration, no duplication, type safety

## Installation

### As a Library (Recommended)
You can install the system directly from the repository using `pip`:

```bash
# Basic installation
pip install arango-entity-resolution

# With ML features (for vector search)
pip install "arango-entity-resolution[ml]"

# For development
pip install -e ".[dev,test]"
```

## Getting Started

### Quick Start with CLI
After installation, you can run a complete entity resolution pipeline from the command line:

```bash
arango-er run --config config/er_config.example.yaml
```

To launch the interactive demo:
```bash
arango-er-demo
```

### Quick Setup for Testing (Legacy/Manual)

## Performance & Scalability

### Bulk Processing for Large Datasets

The system offers **two processing modes** optimized for different use cases:

**Batch Processing (Real-Time)** - For interactive applications and incremental matching
- Best for: < 10K records, real-time duplicate detection, new records
- Performance: Sub-second response times
- API: Foxx `/blocking/candidates`

**Bulk Processing (Batch)** - For offline jobs and large-scale deduplication
- Best for: > 50K records, nightly jobs, full dataset resolution
- Performance: **3-5x faster** than batch mode
- API: Python `BulkBlockingService` or Foxx `/bulk/all-pairs`

### Real-World Performance

| Dataset Size | Batch Mode | Bulk Mode | Speedup |
|--------------|------------|-----------|---------|
| 10K records | 12 seconds | 2.5 seconds | 4.8x |
| 100K records | 2 minutes | 30 seconds | 4x |
| 331K records | 6.6 minutes | 2 minutes | **3.3x** |
| 1M records (projected) | 20 minutes | 5 minutes | 4x |

**Key Advantage:** Bulk processing uses set-based AQL operations that process entire collections in single queries, eliminating network overhead (1 API call vs 3,000+ calls).

**Quick Start:**
```python
from entity_resolution.services.bulk_blocking_service import BulkBlockingService

service = BulkBlockingService()
service.connect()

# Process entire collection in ~2 minutes (331K records)
result = service.generate_all_pairs(
collection_name="customers",
strategies=["exact", "ngram"],
limit=0 # No limit, process all
)

print(f"Found {result['statistics']['total_pairs']:,} pairs")
# Output: Found 45,000 pairs in 120 seconds (3.3x faster!)
```

For complete details, see [Batch vs Bulk Processing Guide](docs/BATCH_VS_BULK_PROCESSING.md).

## API Documentation

The system provides comprehensive APIs for integration into your applications:

### REST API (Foxx Services)
High-performance ArangoDB-native REST endpoints for production use:
- **[API Quick Start](docs/API_QUICKSTART.md)** - Get started in 5 minutes
- **[API Reference](docs/API_REFERENCE.md)** - Complete endpoint documentation
- **[OpenAPI Specification](docs/openapi.yaml)** - REST API schema for code generation

### Python API
Complete SDK for Python applications:
- **[Python API Guide](docs/API_PYTHON.md)** - Detailed SDK reference with examples
- **[API Examples](docs/API_EXAMPLES.md)** - Practical usage examples and integration patterns

### Key Features
- **Dual Interface**: REST API for web integration, Python SDK for application development
- **Production Ready**: Authentication, error handling, batch operations, performance optimization
- **Well Documented**: Complete reference docs, usage examples, integration guides
- **Industry Examples**: Healthcare, finance, e-commerce, and B2B use cases

```python
# Python Example
from entity_resolution.core.entity_resolver import EntityResolutionPipeline

pipeline = EntityResolutionPipeline()
pipeline.connect()
pipeline.load_data("customers.csv", "customers")
results = pipeline.run_complete_pipeline(collection_name="customers")
print(f"Found {results['clustering']['total_clusters']} entity clusters")
```

```bash
# REST API Example
curl -u root:password -X POST \
http://localhost:8529/_db/entity_resolution/entity-resolution/blocking/candidates \
-H "Content-Type: application/json" \
-d '{"collection": "customers", "targetDocId": "customers/12345"}'
```

## System Demonstrations

This project includes a comprehensive demonstration system designed for both technical evaluation and business presentations.

> **Available Datasets**: See [docs/AVAILABLE_DATASETS.md](docs/AVAILABLE_DATASETS.md) for complete information about implemented datasets, test scenarios, and demo execution instructions.

### **Interactive Presentation Demo**

Perfect for live demonstrations to stakeholders, customers, or technical teams:

```bash
# Launch the demo system
python demo/launch_presentation_demo.py

# Choose option 1: Interactive Presentation Demo
```

**Features:**
- **Manual pace control** - pause at each step to explain concepts
- **Clear problem explanation** - show duplicate customer examples
- **Real-time AI processing** - watch similarity analysis and clustering
- **Business impact calculator** - ROI projections for different company sizes
- **Before/after comparisons** - visualize data transformation

**Duration:** 45-60 minutes (fully customizable)

### **Database Inspector**

Show actual database contents during presentations:

```bash
# Launch database inspector
python demo/scripts/database_inspector.py
```

**Capabilities:**
- View raw customer data with duplicates highlighted
- Show similarity analysis results in real-time
- Display entity clusters as they form
- Compare before/after database states
- Export data for offline analysis

### **Quick Demo**

Fast-paced demonstration for time-constrained presentations:

```bash
# Auto-advancing demo (15-20 minutes)
python demo/launch_presentation_demo.py
# Choose option 3: Quick Demo
```

### **Business Impact Examples**

The demo includes real business impact calculations:

| Company Size | Duplicate Cost | Annual Savings | ROI | Payback |
|--------------|----------------|----------------|-----|---------|
| Small (10K customers) | $67,000 | $67,000 | 312% | 9 months |
| Medium (50K customers) | $187,500 | $187,500 | 445% | 6 months |
| Enterprise (500K customers) | $675,000 | $675,000 | 782% | 3 months |

### **Industry Scenarios**

Pre-built demonstration scenarios for different industries:

- **Healthcare**: Patient record deduplication with strict matching requirements
- **Financial**: Customer KYC compliance and fraud detection
- **Retail**: Customer 360 view for personalized marketing
- **B2B Sales**: Account deduplication and relationship mapping

### **Presentation Script**

Comprehensive presentation guide available at `demo/PRESENTATION_SCRIPT.md`:
- 3-act demo structure (Problem -> Solution -> Business Value)
- Talking points for each section
- Audience interaction guidelines
- Q&A preparation with common questions
- Technical deep-dive options

### **Demo Usage Examples**

```bash
# Environment check (verify all components work)
python demo/launch_presentation_demo.py
# Option 6: Environment Check

# Interactive presentation with full control
python demo/scripts/interactive_presentation_demo.py

# Database inspection during demo
python demo/scripts/database_inspector.py

# Automated demo for testing
python demo/scripts/demo_orchestrator.py --auto --records 1000
```

## Implementation Status

### **[IMPLEMENTED] Production-Ready Components**

**Core Entity Resolution Pipeline** - Fully Implemented
- [DONE] **Data Management**: Complete ingestion and validation system
- [DONE] **Record Blocking**: Multi-strategy blocking with ArangoSearch (99%+ efficiency)
- [DONE] **Similarity Computation**: Fellegi-Sunter probabilistic framework
- [DONE] **Entity Clustering**: Graph-based Weakly Connected Components
- [DONE] **Golden Record Generation**: Automated master record creation
- [DONE] **Quality Scoring**: Comprehensive data quality assessment

**Infrastructure & Architecture** - Production Ready
- [DONE] **Database Layer**: Centralized connection management, no code duplication
- [DONE] **Configuration System**: Environment-based settings with validation
- [DONE] **Service Architecture**: Modular design with standardized interfaces 
- [DONE] **Error Handling**: Consistent error patterns and logging
- [DONE] **Performance**: 1,000+ records/second processing capability

**Demonstration System** - Complete
- [DONE] **Interactive Presentations**: Manual-paced demos for stakeholders
- [DONE] **Database Inspector**: Real-time process visualization
- [DONE] **Business Impact Tools**: ROI calculators and industry scenarios
- [DONE] **Multiple Demo Modes**: Presentation, automated, and quick options

**Python-Based Architecture** - v2.0
- [DONE] **Strategy Pattern**: Flexible blocking strategies (COLLECT, BM25)
- [DONE] **Batch Processing**: Optimized similarity computation (100K+ pairs/sec)
- [DONE] **Performance Benchmarking**: Comprehensive testing and validation

### **Current Capabilities**

| Component | Status | Performance | Notes |
|-----------|---------|-------------|--------|
| Data Ingestion | [PRODUCTION] | 10K+ records/min | Multiple source support |
| Record Blocking | [PRODUCTION] | 99%+ reduction | ArangoSearch integration |
| Similarity Matching | [PRODUCTION] | 1K+ pairs/sec | Fellegi-Sunter framework |
| Entity Clustering | [PRODUCTION] | Sub-second | Graph algorithms |
| Golden Records | [PRODUCTION] | Real-time | Conflict resolution |
| Demo System | [COMPLETE] | Interactive | Business presentations |

### **Ready for Production**

The system is fully operational and ready for real-world entity resolution challenges:
- **Scalability**: Handles millions of records efficiently
- **Accuracy**: 99.5% precision, 98% recall in testing
- **Performance**: 1,000+ records/second processing
- **Reliability**: Comprehensive error handling and validation
- **Maintainability**: Clean architecture with centralized components

## Documentation

Complete documentation is available in the [`docs/`](docs/) directory:

### Quick Links
- **[Documentation Index](docs/README.md)** - Complete documentation navigation
- **[Quick Start Guide](docs/guides/QUICK_START.md)** - Get started in 5 minutes
- **[API Reference](docs/api/API_REFERENCE.md)** - Complete API documentation
- **[Migration Guide v3.0](docs/guides/MIGRATION_GUIDE_V3.md)** - Upgrade from v1.x or v2.x

### By Category
- **Guides**: Migration, custom collections, testing
- **Architecture**: System design, graph algorithms, Foxx services
- **Development**: Enhancement plans, project evolution
- **API**: Complete reference with examples
- **Research**: Academic papers and research notes

See the **[Documentation Index](docs/README.md)** for the complete catalog.

## Research Foundation

This project is built upon extensive academic research in entity resolution, spanning traditional techniques to cutting-edge AI/ML approaches.

### **Current Research Base**
See the [research](research/) directory for papers and notes on:
- **Record Blocking**: Papadakis et al. surveys on blocking and filtering techniques
- **Probabilistic Matching**: Fellegi-Sunter framework for record linkage
- **Entity Matching Systems**: Magellan and other end-to-end systems

### **Planned Research Integration**
The following areas will be documented with relevant academic papers:

**Graph Embeddings & Network Analysis**
- Graph embedding techniques (Node2Vec, GraphSAGE, etc.)
- Community detection algorithms for entity clustering
- Network-based entity resolution approaches

**Vector Search & Semantic Similarity**
- Approximate Nearest Neighbor (ANN) algorithms
- Embedding-based entity matching
- Multi-modal embedding approaches

**LLM & GraphRAG**
- Large Language Models for entity extraction
- Retrieval-Augmented Generation (RAG) for entity resolution
- Graph-enhanced RAG (GraphRAG) architectures
- Prompt engineering for entity matching decisions

**Geospatial-Temporal Analysis**
- Spatial-temporal data mining for entity resolution
- Location verification and validation techniques
- Movement pattern analysis for entity tracking

**Hybrid & Ensemble Methods**
- Combining multiple ER techniques
- Confidence aggregation across methods
- Multi-criteria decision making for entity matching

> Note: As new academic papers are identified and reviewed, this section will be expanded with detailed notes and implementation insights.

## Contributing

Please ensure any contributions align with the project requirements outlined in the [PRD](docs/PRD.md) and follow the established coding standards:

### Code Standards
- **Python 3.8+** with type hints
- **DRY Principles**: Use shared utilities in `scripts/common/`
- **Error Handling**: Consistent messaging patterns
- **Documentation**: Comprehensive docstrings and comments
- **Environment**: Use environment variables for configuration

### Development Workflow
1. **Install Git Hooks** - Set up pre-commit checks (see [Git Hooks Guide](docs/development/GIT_HOOKS.md))
```bash
./scripts/setup-git-hooks.sh
```
2. Review the [Testing Guide](docs/guides/TESTING_GUIDE.md)
3. Check the [CHANGELOG](CHANGELOG.md) for recent changes
4. Follow the established patterns in existing scripts
5. Test changes with the Docker environment
6. Update documentation if needed

### Quality Assurance

**Pre-Commit Checks** - Automated validation before each commit:
- Python syntax validation in core modules
- No hardcoded credentials (passwords must use environment variables)
- ASCII-only code (no emoji characters)
- Critical module import verification

The pre-commit hook runs automatically (~5 seconds) and prevents commits if issues are found. See [docs/development/GIT_HOOKS.md](docs/development/GIT_HOOKS.md) for details.

### Getting Help
- **Documentation**: Start with the [Documentation Index](docs/README.md)
- **Issues**: Use GitHub Issues for bugs and feature requests
- **Research**: Check `research/` directory for academic background
