Metadata-Version: 2.4
Name: ddn-metadata-bootstrap
Version: 1.0.14
Summary: AI-powered metadata enhancement for Hasura DDN schema files
Author-email: Kenneth Stott <kenneth@hasura.io>
Maintainer-email: Kenneth Stott <kenneth@hasura.io>
License: MIT
Project-URL: Homepage, https://github.com/hasura/ddn-metadata-bootstrap
Project-URL: Documentation, https://github.com/hasura/ddn-metadata-bootstrap#readme
Project-URL: Repository, https://github.com/hasura/ddn-metadata-bootstrap.git
Project-URL: Bug Reports, https://github.com/hasura/ddn-metadata-bootstrap/issues
Project-URL: Changelog, https://github.com/hasura/ddn-metadata-bootstrap/blob/main/CHANGELOG.md
Keywords: hasura,ddn,graphql,schema,metadata,ai,anthropic,descriptions,relationships
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Database
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Software Development :: Code Generators
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.3.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: inflection>=0.5.1
Requires-Dist: nltk>=3.9.1
Requires-Dist: openai>=1.97.0
Requires-Dist: google-generativeai>=0.8.5
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Requires-Dist: bump2version>=1.0.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Dynamic: license-file

# DDN Metadata Bootstrap

[![PyPI version](https://badge.fury.io/py/ddn-metadata-bootstrap.svg)](https://badge.fury.io/py/ddn-metadata-bootstrap)
[![Python versions](https://img.shields.io/pypi/pyversions/ddn-metadata-bootstrap.svg)](https://pypi.org/project/ddn-metadata-bootstrap/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

AI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.

## 🚀 Features

### 🤖 **Multi-Provider AI Support**
- **Anthropic Claude**: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models
- **OpenAI GPT**: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models
- **Google Gemini**: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models
- **Automatic Fallback**: Graceful degradation between providers with configurable priorities
- **Provider-Specific Optimization**: Model-specific prompting and parameter tuning

### 🧠 **Advanced AI Generation**
- **Quality Assessment with Retry Logic**: Multi-attempt generation with configurable scoring thresholds
- **Context-Aware Business Descriptions**: Domain-specific system prompts with industry context
- **Smart Field Analysis**: Automatically detects and skips self-explanatory, generic, or cryptic fields
- **Configurable Length Controls**: Precise control over description length and token usage

### 🧠 **Intelligent Caching System** 
- **Similarity-Based Matching**: Reuses descriptions for similar fields across entities (85% similarity threshold)
- **Performance Optimization**: Reduces API calls by up to 70% on large schemas through intelligent caching
- **Cache Statistics**: Real-time performance monitoring with hit rates and API cost savings tracking
- **Type-Aware Matching**: Considers field types and entity context for better cache accuracy

### 🔍 **WordNet-Based Linguistic Analysis**
- **Generic Term Detection**: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields
- **Semantic Density Analysis**: Evaluates conceptual richness and specificity of field names
- **Definition Quality Scoring**: Ensures meaningful, non-circular descriptions through linguistic validation
- **Abstraction Level Calculation**: Determines appropriate description depth based on semantic analysis

### 📝 **Enhanced Acronym Expansion**
- **Comprehensive Mappings**: 200+ pre-configured acronyms for technology, finance, and business domains
- **Context-Aware Expansion**: Industry-specific acronym interpretation based on domain context
- **Pre-Generation Enhancement**: Expands acronyms BEFORE AI generation for better context
- **Custom Domain Support**: Fully configurable acronym mappings via YAML configuration

### 🔗 **Advanced Relationship Detection**
- **Template-Based FK Detection**: Sophisticated foreign key detection with confidence scoring and semantic validation
- **Shared Business Key Relationships**: Many-to-many relationships via shared field analysis with FK-aware filtering
- **Cross-Subgraph Intelligence**: Smart entity matching across different subgraphs
- **Configurable Templates**: Flexible FK template patterns with placeholders for complex naming conventions
- **Advanced Relationship Blocking**: Precision rule-based system to prevent inappropriate cross-connector relationships

### ⚙️ **Comprehensive Configuration System**
- **YAML-First Configuration**: Central `config.yaml` file for all settings with full documentation
- **Waterfall Precedence**: CLI args > Environment variables > config.yaml > defaults
- **Configuration Validation**: Comprehensive validation with helpful error messages and source tracking
- **Feature Toggles**: Granular control over processing features (descriptions vs relationships)

### 🎯 **Advanced Quality Controls**
- **Buzzword Detection**: Avoids corporate jargon and meaningless generic terms
- **Pattern-Based Filtering**: Regex-based rejection of poor description formats
- **Technical Language Translation**: Converts technical terms to business-friendly language
- **Length Optimization**: Multiple validation layers with hard limits and target lengths

### 🔍 **Intelligent Field Selection**
- **Generic Field Detection**: Skips overly common fields that don't benefit from descriptions
- **Cryptic Abbreviation Handling**: Configurable handling of unclear field names with vowel analysis
- **Self-Explanatory Pattern Recognition**: Automatically identifies fields that don't need descriptions
- **Value Assessment**: Only generates descriptions that add meaningful business value

## 📦 Installation

### From PyPI (Recommended)

```bash
pip install ddn-metadata-bootstrap
```

### Provider-Specific Dependencies

The tool supports multiple AI providers. Install the dependencies for your chosen provider:

```bash
# For Anthropic Claude (default)
pip install ddn-metadata-bootstrap[anthropic]
# or separately:
pip install anthropic

# For OpenAI GPT  
pip install ddn-metadata-bootstrap[openai]
# or separately:
pip install openai

# For Google Gemini
pip install ddn-metadata-bootstrap[gemini]
# or separately: 
pip install google-generativeai

# Install all providers
pip install ddn-metadata-bootstrap[all]
```

### From Source

```bash
git clone https://github.com/hasura/ddn-metadata-bootstrap.git
cd ddn-metadata-bootstrap
pip install -e .
```

## 🏃 Quick Start

### 1. Choose Your AI Provider

#### Option A: Anthropic Claude (Default - Recommended)
```bash
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"  # Optional (default)
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-haiku-20240307"  # Optional
```

#### Option B: OpenAI GPT
```bash
export OPENAI_API_KEY="your-openai-api-key"  
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_OPENAI_MODEL="gpt-3.5-turbo"  # Optional
```

#### Option C: Google Gemini
```bash
export GEMINI_API_KEY="your-gemini-api-key"
# or alternatively:
export GOOGLE_API_KEY="your-gemini-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="gemini"
export METADATA_BOOTSTRAP_GEMINI_MODEL="gemini-pro"  # Optional
```

### 2. Set up your directories

```bash
export METADATA_BOOTSTRAP_INPUT_DIR="./app/metadata"
export METADATA_BOOTSTRAP_OUTPUT_DIR="./enhanced_metadata"
```

### 3. Create a configuration file (Recommended)

Create a `config.yaml` file in your project directory:

```yaml
# config.yaml - DDN Metadata Bootstrap Configuration

# =============================================================================
# AI PROVIDER CONFIGURATION
# =============================================================================
ai_provider: "anthropic"  # Choose: anthropic, openai, gemini

# Provider-specific API keys (alternatively set via environment variables)
# anthropic_api_key: "your-anthropic-key"
# openai_api_key: "your-openai-key" 
# gemini_api_key: "your-gemini-key"

# Provider-specific models
anthropic_model: "claude-3-haiku-20240307"  # claude-3-sonnet-20240229, claude-3-opus-20240229
openai_model: "gpt-3.5-turbo"               # gpt-4, gpt-4o-mini, gpt-4-turbo-preview
gemini_model: "gemini-pro"                  # gemini-1.5-pro-latest, gemini-1.5-flash

# =============================================================================
# FEATURE CONTROL
# =============================================================================
relationships_only: false          # Set to true to only generate relationships, skip descriptions
enable_quality_assessment: true    # Enable AI quality scoring and retry logic

# =============================================================================
# AI GENERATION SETTINGS
# =============================================================================
# Domain-specific system prompt for your organization
system_prompt: |
  You generate concise field descriptions for database schema metadata at a global financial services firm.
  
  DOMAIN CONTEXT:
  - Organization: Global bank
  - Department: Cybersecurity operations  
  - Use case: Risk management and security compliance
  - Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)
  
  Think: "What would a cybersecurity analyst at a bank need to know about this field?"

# Token and length limits
field_tokens: 25                    # Max tokens AI can generate for field descriptions
kind_tokens: 50                     # Max tokens AI can generate for kind descriptions
field_desc_max_length: 120          # Maximum total characters for field descriptions
kind_desc_max_length: 250           # Maximum total characters for entity descriptions

# Quality thresholds
minimum_description_score: 70       # Minimum score (0-100) to accept a description
max_description_retry_attempts: 3   # How many times to retry for better quality

# =============================================================================
# ENHANCED ACRONYM EXPANSION
# =============================================================================
acronym_mappings:
  # Technology & Computing
  api: "Application Programming Interface"
  ui: "User Interface"
  db: "Database"
  
  # Security & Access Management
  mfa: "Multi-Factor Authentication"
  sso: "Single Sign-On"
  iam: "Identity and Access Management"
  siem: "Security Information and Event Management"
  
  # Financial Services & Compliance
  pci: "Payment Card Industry"
  sox: "Sarbanes-Oxley Act"
  kyc: "Know-Your-Customer"
  aml: "Anti-Money Laundering"
  # ... 200+ total mappings available

# =============================================================================
# INTELLIGENT FIELD SELECTION
# =============================================================================
# Fields to skip entirely - these will not get descriptions at all
skip_field_patterns:
  - "^id$"
  - "^_id$"
  - "^uuid$"
  - "^created_at$"
  - "^updated_at$"
  - "^debug_.*"
  - "^test_.*"
  - "^temp_.*"

# Generic fields - won't get unique descriptions (too common)
generic_fields:
  - "id"
  - "key"
  - "uid"
  - "guid"
  - "name"

# Self-explanatory fields - simple patterns that don't need descriptions
self_explanatory_patterns:
  - '^id$'
  - '^_id$'
  - '^guid$'
  - '^uuid$'
  - '^key$'

# Cryptic Field Handling
skip_cryptic_abbreviations: true   # Skip fields with unclear abbreviations
skip_ultra_short_fields: true      # Skip very short field names that are likely abbreviations
max_cryptic_field_length: 4        # Field names this length or shorter are considered cryptic

# Content quality controls
buzzwords: [
  'synergy', 'leverage', 'paradigm', 'ecosystem',
  'contains', 'stores', 'holds', 'represents'
]

forbidden_patterns: [
  'this\\s+field\\s+represents',
  'used\\s+to\\s+(track|manage|identify)',
  'business.*information'
]

# =============================================================================
# RELATIONSHIP DETECTION
# =============================================================================
# FK Template Patterns for relationship detection
# Format: "{pk_pattern}|{fk_pattern}"
# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier
fk_templates:
  - "{gi}|{pm}_{pt}_{gi}"           # active_service_name → Services.name
  - "{gi}|{pt}_{gi}"                # user_id → Users.id
  - "{pt}_{gi}|{pm}_{pt}_{gi}"      # user_id → ActiveUsers.active_user_id

# =============================================================================
# ADVANCED RELATIONSHIP BLOCKING
# =============================================================================
# Precision rule-based system to prevent inappropriate relationships
# Uses bidirectional validation with data_connector + entity + field pattern matching
fk_key_blacklist:
  # Block cross-cloud provider connections with infrastructure fields
  - entity_pattern_a:
      data_connector: "^(gcp|arg|various)$"  # Google Cloud Platform, Azure Resource Graph, Various
      entity: "^(gcp_|google_).*"            # Google/GCP entities
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|arg|various)$"  
      entity: "^(az_|azure_).*"              # Azure entities
      field: ".*(resource|project|policy|storage|compute).*"  # Infrastructure fields only
    logic: "and"
    reason: "Block google/gcp entities from connecting to azure entities with infrastructure-related fields"
  
  # Complete isolation between major cloud platforms
  - entity_pattern_a:
      data_connector: "^gcp$"                # Google Cloud Platform connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^arg$"                # Azure Resource Graph connector  
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    logic: "and"
    reason: "Block all connections between Google Cloud Platform and Azure Resource Graph connectors"
  
  # Block legacy systems from modern cloud entities
  - entity_pattern_a:
      data_connector: "^legacy_.*"           # Any legacy connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|azure|aws)_.*"  # Modern cloud connectors
      entity: "^(google_|azure_|aws_).*"     # Modern cloud entities
      field: ".*"                            # Any field
    logic: "and"
    reason: "Prevent legacy systems from directly connecting to modern cloud infrastructure"
  
  # Block test/dev environments from production data
  - entity_pattern_a:
      data_connector: "^(test|dev)_.*"       # Test/dev connectors
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^prod_.*"             # Production connectors
      entity: ".*"                           # Any entity  
      field: ".*(sensitive|private|confidential|secret).*"  # Sensitive fields
    logic: "and"
    reason: "Prevent test/dev environments from accessing production sensitive data"

# Shared relationship limits
max_shared_relationships: 10000
max_shared_per_entity: 10
min_shared_confidence: 30
```

### 4. Run the tool with your chosen provider

```bash
# Use default provider (Anthropic)
ddn-metadata-bootstrap

# Use OpenAI explicitly
ddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key

# Use Gemini with specific model
ddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro

# Show configuration including AI provider setup
ddn-metadata-bootstrap --show-config

# Test your AI provider connection
ddn-metadata-bootstrap --test-provider

# Process only relationships (skip descriptions)
ddn-metadata-bootstrap --relationships-only

# Use custom configuration file
ddn-metadata-bootstrap --config custom-config.yaml

# Enable verbose logging to see AI provider selection and caching
ddn-metadata-bootstrap --verbose
```

## 🔗 Advanced Relationship Blocking System

The tool now includes a sophisticated **bidirectional relationship blocking system** that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.

### Key Features

#### **Precision Pattern Matching**
Each blocking rule uses three-part patterns for maximum precision:
- **Data Connector**: Regex pattern matching the connector name (e.g., `^gcp$`, `^(test|dev)_.*`)
- **Entity Name**: Regex pattern matching the entity/table name (e.g., `^google_.*`, `^azure_storage.*`)
- **Field Name**: Regex pattern matching the field name (e.g., `.*resource.*`, `.*secret.*`)

#### **Bidirectional Validation**
Rules automatically check both directions of a relationship:
- **Pattern A → Pattern B**: `google_compute` → `azure_storage_resource`
- **Pattern B → Pattern A**: `azure_vm` → `google_analytics_data`

Both directions are blocked by a single rule definition.

#### **Flexible Logic Operators**
- **AND Logic**: All patterns (connector AND entity AND field) must match for both sides
- **OR Logic**: Either side matching its full pattern triggers the block

### Real-World Examples

#### **Cross-Cloud Security Isolation**
```yaml
# Block Google Cloud from Azure Resource Graph
- entity_pattern_a:
    data_connector: "^gcp$"        # Google Cloud Platform
    entity: ".*"                   # Any GCP entity
    field: ".*"                    # Any field
  entity_pattern_b:
    data_connector: "^arg$"        # Azure Resource Graph  
    entity: ".*"                   # Any Azure entity
    field: ".*"                    # Any field
  logic: "and"
  reason: "Complete isolation between cloud providers for security compliance"
```

#### **Environment Separation**
```yaml
# Block test environments from production sensitive data
- entity_pattern_a:
    data_connector: "^(test|dev)_.*"
    entity: ".*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^prod_.*"
    entity: ".*"
    field: ".*(pii|ssn|credit_card|password).*"
  logic: "and"
  reason: "Prevent test/dev access to production sensitive data"
```

#### **Infrastructure Boundaries**
```yaml
# Block legacy systems from modern cloud infrastructure
- entity_pattern_a:
    data_connector: "^legacy_.*"
    entity: "^(mainframe|cobol)_.*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^(gcp|aws|azure)_.*"
    entity: "^(kubernetes|container|serverless)_.*"
    field: ".*"
  logic: "and"
  reason: "Prevent direct legacy-to-cloud connections without proper integration layer"
```

### Validation Process

When the tool evaluates a potential relationship like `gcp_users.user_id → azure_storage.user_reference`:

1. **Rule Matching**: Check if both entities match any blocking rule patterns
2. **Bidirectional Check**: Test both A→B and B→A directions
3. **Pattern Validation**: Verify connector + entity + field patterns all match
4. **Logic Application**: Apply AND/OR logic to determine if rule triggers
5. **Blocking Decision**: Block relationship if any rule matches

### Configuration Validation

The system includes comprehensive validation:

```bash
# Validate your FK blacklist rules
ddn-metadata-bootstrap --validate-config

# Test specific blocking scenarios
ddn-metadata-bootstrap --test-fk-blocking

# Show compiled regex patterns
ddn-metadata-bootstrap --show-config --verbose
```

### Performance Impact

The blocking system is highly optimized:
- **Regex Compilation**: Patterns compiled once at startup
- **Hash-Based Lookups**: Connector names normalized for fast matching
- **Early Exit**: Rules skip quickly when connector patterns don't match
- **Minimal Overhead**: Adds <1ms per relationship evaluation

### Enterprise Security Benefits

#### **Compliance Support**
- **SOX Compliance**: Prevent cross-environment data access
- **GDPR Compliance**: Block PII data from reaching test systems
- **Industry Regulations**: Enforce data residency and access controls

#### **Cloud Security**
- **Multi-Cloud Isolation**: Prevent accidental cross-cloud data references
- **Vendor Lock-in Prevention**: Maintain clear cloud provider boundaries
- **Zero Trust Architecture**: Explicit relationship approval required

#### **Operational Safety**
- **Environment Isolation**: Prevent production data leaks to development
- **Legacy System Protection**: Control access to deprecated systems
- **Change Management**: Explicit approval required for cross-boundary relationships

## 🤖 AI Provider Comparison

### Performance & Cost Comparison

| Provider | Speed | Cost | Quality | Best For |
|----------|-------|------|---------|----------|
| **Anthropic Claude Haiku** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐⭐ High | Development, High Volume |
| **Anthropic Claude Sonnet** | ⚡⚡ Fast | 💰💰 Medium | ⭐⭐⭐⭐⭐ Excellent | Production, Balanced |
| **Anthropic Claude Opus** | ⚡ Medium | 💰💰💰 High | ⭐⭐⭐⭐⭐ Excellent | Critical Schemas |
| **OpenAI GPT-3.5 Turbo** | ⚡⚡⚡ Very Fast | 💰 Very Low | ⭐⭐⭐ Good | Development, Budget |
| **OpenAI GPT-4o Mini** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐⭐ High | Production, Cost-Optimized |
| **OpenAI GPT-4** | ⚡⚡ Fast | 💰💰💰 High | ⭐⭐⭐⭐⭐ Excellent | Premium Quality |
| **Google Gemini Pro** | ⚡⚡ Fast | 💰 Very Low | ⭐⭐⭐⭐ High | Large Scale, Budget |
| **Google Gemini 1.5 Flash** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐ Good | High Throughput |

### Provider-Specific Configuration Examples

#### Anthropic Claude (Recommended)
```yaml
ai_provider: "anthropic"
anthropic_model: "claude-3-haiku-20240307"  # Fast & cost-effective
# anthropic_model: "claude-3-sonnet-20240229"  # Balanced
# anthropic_model: "claude-3-opus-20240229"    # Highest quality

# Anthropic-optimized settings
field_tokens: 30
system_prompt: |
  Generate concise, business-focused field descriptions.
  Focus on practical utility and clear business meaning.
```

#### OpenAI GPT (Cost-Optimized)
```yaml
ai_provider: "openai"
openai_model: "gpt-4o-mini"  # Best balance of cost and quality
# openai_model: "gpt-3.5-turbo"     # Most cost-effective
# openai_model: "gpt-4-turbo-preview"  # Highest quality

# OpenAI-optimized settings
field_tokens: 25
system_prompt: |
  You are a technical writer creating database field descriptions.
  Be concise, specific, and business-focused.
```

#### Google Gemini (High Volume)
```yaml
ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"  # High throughput
# gemini_model: "gemini-pro"           # Balanced
# gemini_model: "gemini-1.5-pro-latest"  # Highest quality

# Gemini-optimized settings
field_tokens: 35
system_prompt: |
  Create clear, professional descriptions for database schema fields.
  Focus on business value and practical understanding.
```

## 📝 Enhanced Examples

### Multi-Provider Description Generation

#### Input Schema (HML)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  fields:
    - name: riskId
      type: String!
    - name: mfaEnabled
      type: Boolean!
    - name: ssoConfig
      type: String
    - name: iamPolicy
      type: String
```

#### Output with Different Providers

##### Anthropic Claude (Business-Focused)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Security risk evaluation and compliance status tracking for 
    organizational threat management and regulatory oversight.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment identifier for tracking security evaluations.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication enablement status for security policy compliance.
    - name: ssoConfig
      type: String
      description: Single Sign-On configuration settings for identity management.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy governing user permissions.
```

##### OpenAI GPT (Technical-Focused)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Cybersecurity threat assessment data structure containing risk metrics
    and security configuration parameters for compliance monitoring.
  fields:
    - name: riskId
      type: String!
      description: Unique identifier for security risk assessment records.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication activation flag for access control.
    - name: ssoConfig
      type: String
      description: Single Sign-On system configuration parameters.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy specification document.
```

##### Google Gemini (Comprehensive)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Comprehensive security threat assessment record containing risk analysis,
    authentication configurations, and access management policies for enterprise security.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment record identifier for security threat tracking.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication status indicator for enhanced security protocols.
    - name: ssoConfig
      type: String
      description: Single Sign-On integration configuration for unified authentication.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy definition for authorization control.
```

### Provider Fallback and Testing

```bash
# Test provider connectivity
ddn-metadata-bootstrap --test-provider
# Output:
# 🧪 Testing ANTHROPIC provider connection...
# ✅ ANTHROPIC connection successful
#    Model: claude-3-haiku-20240307
#    Response: Hello

# Test specific provider
ddn-metadata-bootstrap --ai-provider openai --test-provider
# Output:
# 🧪 Testing OPENAI provider connection...
# ✅ OPENAI connection successful
#    Model: gpt-3.5-turbo
#    Response: Hello

# Show detailed provider configuration
ddn-metadata-bootstrap --show-config
# Output:
# 📋 Configuration Sources:
# ai_provider                    = anthropic              [defaults]
# anthropic_api_key              = ***masked***           [env:ANTHROPIC_API_KEY]
# anthropic_model                = claude-3-haiku-20240307 [defaults]
# 
# 🤖 AI Provider Configuration:
#    Provider: anthropic
#    Model: claude-3-haiku-20240307
#    API Key: ***configured***
```

### Performance with Caching Across Providers

```bash
# Provider performance comparison with caching
🔄 Processing with ANTHROPIC (claude-3-haiku-20240307)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)
API calls made: 127
Processing time: 2.1 minutes
Provider cost: $0.89

🔄 Processing with OPENAI (gpt-4o-mini)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)  # Same cache used!
API calls made: 127
Processing time: 1.8 minutes
Provider cost: $0.52

🔄 Processing with GEMINI (gemini-1.5-flash)...
Processing 500 fields across 50 entities...
Cache hits: 298 (70.1% hit rate)  # Same cache used!
API calls made: 127
Processing time: 2.3 minutes
Provider cost: $0.31
```

## ⚙️ Advanced Multi-Provider Configuration

### Provider-Specific Optimization

```yaml
# Development configuration - prioritize speed and cost
ai_provider: "openai"
openai_model: "gpt-4o-mini"
field_tokens: 20
minimum_description_score: 60
enable_quality_assessment: false

# Production configuration - prioritize quality
ai_provider: "anthropic"  
anthropic_model: "claude-3-sonnet-20240229"
field_tokens: 35
minimum_description_score: 80
max_description_retry_attempts: 5

# High-volume configuration - prioritize throughput
ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"
field_tokens: 25
minimum_description_score: 65
enable_quality_assessment: true
```

### Environment-Based Provider Selection

```bash
# Development environment
export ENVIRONMENT="development"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export OPENAI_API_KEY="your-dev-key"

# Staging environment  
export ENVIRONMENT="staging"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export ANTHROPIC_API_KEY="your-staging-key"

# Production environment
export ENVIRONMENT="production"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export ANTHROPIC_API_KEY="your-prod-key"
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-sonnet-20240229"
```

## 🐍 Python API with Multi-Provider Support

```python
from ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper
from ddn_metadata_bootstrap.description_generator import DescriptionGenerator
import logging

# Configure logging to see provider selection and caching
logging.basicConfig(level=logging.INFO)

# Method 1: Use configuration file
config = BootstrapperConfig(config_file="./config.yaml")

# Method 2: Programmatic provider selection
config = BootstrapperConfig()
config.ai_provider = "openai"
config.openai_api_key = "your-openai-key"
config.openai_model = "gpt-4o-mini"

# Method 3: Direct generator creation with provider
generator = DescriptionGenerator(
    api_key="your-api-key",
    model="claude-3-haiku-20240307",
    provider="anthropic"  # or "openai", "gemini"
)

# Create bootstrapper with multi-provider support
bootstrapper = MetadataBootstrapper(config)

# Process directory with provider-optimized settings
results = bootstrapper.process_directory(
    input_dir="./app/metadata",
    output_dir="./enhanced_metadata"
)

# Get provider-specific statistics
stats = bootstrapper.get_statistics()
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model Used: {stats['model_used']}")
print(f"Provider API Calls: {stats['provider_api_calls']}")
print(f"Provider Cost: ${stats['estimated_provider_cost']:.2f}")

# Switch providers dynamically
for provider in ['anthropic', 'openai', 'gemini']:
    try:
        test_generator = DescriptionGenerator(
            api_key=f"your-{provider}-key",
            provider=provider
        )
        print(f"✅ {provider.upper()} available")
    except ImportError as e:
        print(f"❌ {provider.upper()} unavailable: {e}")
```

## 📊 Enhanced Statistics & Monitoring

```python
# Provider-specific performance tracking
stats = bootstrapper.get_statistics()

# AI Provider metrics
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model: {stats['model_used']}")
print(f"Provider API calls: {stats['provider_api_calls']}")
print(f"Average response time: {stats['avg_response_time_ms']}ms")
print(f"Provider cost: ${stats['estimated_provider_cost']:.3f}")

# Quality comparison across providers
print(f"Average quality score: {stats['average_quality_score']}")
print(f"Quality retries: {stats['quality_retries']}")
print(f"Provider-specific quality: {stats['provider_quality_metrics']}")

# Cross-provider caching efficiency
if 'cache_stats' in stats:
    cache_stats = stats['cache_stats']
    print(f"Cache hit rate: {cache_stats['hit_rate']:.1%}")
    print(f"Cross-provider cache reuse: {cache_stats['cross_provider_reuse']}")
    print(f"Provider switching savings: ${cache_stats['switching_savings']:.2f}")

# Relationship blocking statistics
if 'relationship_stats' in stats:
    rel_stats = stats['relationship_stats']
    print(f"Relationships considered: {rel_stats['relationships_considered']}")
    print(f"Relationships blocked: {rel_stats['relationships_blocked']}")
    print(f"FK blacklist hits: {rel_stats['fk_blacklist_hits']}")
    print(f"Cross-connector blocks: {rel_stats['cross_connector_blocks']}")
```

## 🚀 Provider-Specific Performance Improvements

### Real-World Performance by Provider

#### Anthropic Claude
```bash
Provider: Anthropic Claude Haiku
Processing 500 fields...
✅ Strengths:
- Excellent business context understanding
- Consistent quality across attempts
- Good acronym expansion integration
- Fast response times (avg 850ms)

📊 Results:
- API calls: 127 (after caching)
- Processing time: 2.1 minutes  
- Average quality score: 82
- Cost: $0.89
```

#### OpenAI GPT
```bash
Provider: OpenAI GPT-4o Mini
Processing 500 fields...
✅ Strengths:
- Very fast response times (avg 650ms)
- Excellent technical accuracy
- Cost-effective for high volume
- Good structured output

📊 Results:
- API calls: 127 (after caching)
- Processing time: 1.8 minutes
- Average quality score: 78
- Cost: $0.52
```

#### Google Gemini
```bash
Provider: Google Gemini 1.5 Flash
Processing 500 fields...
✅ Strengths:
- Lowest cost per operation
- Good multilingual support
- Generous rate limits
- Comprehensive descriptions

📊 Results:
- API calls: 127 (after caching)
- Processing time: 2.3 minutes
- Average quality score: 76
- Cost: $0.31
```

## 🧪 Testing Multi-Provider Features

```bash
# Test all providers
pytest tests/test_multi_provider.py -v

# Test provider switching
pytest tests/test_provider_switching.py -v

# Test provider-specific optimizations
pytest tests/test_provider_optimization.py -v

# Test configuration validation for all providers
pytest tests/test_provider_config.py -v

# Test FK blacklist functionality
pytest tests/test_fk_blacklist.py -v

# Run performance benchmarks across providers
pytest tests/benchmark_providers.py -v --benchmark-only
```

## 🤝 Contributing

### Multi-Provider Development Areas

1. **Provider Integration**
   - Additional AI provider support (Claude-4, GPT-5, etc.)
   - Provider-specific optimization algorithms
   - Custom model fine-tuning support

2. **Performance Optimization**
   - Provider-specific prompt engineering
   - Dynamic provider selection based on workload
   - Cost optimization strategies

3. **Quality Assessment**
   - Provider-specific quality metrics
   - Cross-provider quality comparison
   - A/B testing frameworks

4. **Caching Enhancements**
   - Provider-aware cache invalidation
   - Cross-provider description comparison
   - Quality-based cache prioritization

5. **Relationship Blocking**
   - Visual rule builder for FK blacklists
   - Rule impact analysis and testing
   - Advanced pattern matching algorithms

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🆘 Support

- 📖 [Documentation](https://github.com/hasura/ddn-metadata-bootstrap#readme)
- 🐛 [Bug Reports](https://github.com/hasura/ddn-metadata-bootstrap/issues)
- 💬 [Discussions](https://github.com/hasura/ddn-metadata-bootstrap/discussions)
- 🤖 [AI Provider Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aai-provider)
- 🧠 [Caching Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Acaching)
- 🔍 [Quality Assessment Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aquality)
- 🔗 [Relationship Blocking Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Arelationship-blocking)

## 🏷️ Version History

See [CHANGELOG.md](CHANGELOG.md) for complete version history and breaking changes.

## ⭐ Acknowledgments

- Built for [Hasura DDN](https://hasura.io/ddn)
- Powered by [Anthropic Claude](https://www.anthropic.com/), [OpenAI GPT](https://openai.com/), and [Google Gemini](https://deepmind.google/technologies/gemini/)
- Linguistic analysis powered by [NLTK](https://www.nltk.org/) and [WordNet](https://wordnet.princeton.edu/)
- Inspired by the GraphQL and OpenAPI communities
- Caching algorithms inspired by database query optimization techniques
- Relationship blocking patterns inspired by enterprise security frameworks

---

Made with ❤️ by the Hasura team
