Metadata-Version: 2.4
Name: ddn-metadata-bootstrap
Version: 1.0.16
Summary: AI-powered metadata enhancement for Hasura DDN schema files
Author-email: Kenneth Stott <kenneth@hasura.io>
Maintainer-email: Kenneth Stott <kenneth@hasura.io>
License: MIT
Project-URL: Homepage, https://github.com/hasura/ddn-metadata-bootstrap
Project-URL: Documentation, https://github.com/hasura/ddn-metadata-bootstrap#readme
Project-URL: Repository, https://github.com/hasura/ddn-metadata-bootstrap.git
Project-URL: Bug Reports, https://github.com/hasura/ddn-metadata-bootstrap/issues
Project-URL: Changelog, https://github.com/hasura/ddn-metadata-bootstrap/blob/main/CHANGELOG.md
Keywords: hasura,ddn,graphql,schema,metadata,ai,anthropic,descriptions,relationships
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Database
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Software Development :: Code Generators
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.3.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: inflection>=0.5.1
Requires-Dist: nltk>=3.9.1
Requires-Dist: openai>=1.97.0
Requires-Dist: google-generativeai>=0.8.5
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Requires-Dist: bump2version>=1.0.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Dynamic: license-file

# DDN Metadata Bootstrap

[![PyPI version](https://badge.fury.io/py/ddn-metadata-bootstrap.svg)](https://badge.fury.io/py/ddn-metadata-bootstrap)
[![Python versions](https://img.shields.io/pypi/pyversions/ddn-metadata-bootstrap.svg)](https://pypi.org/project/ddn-metadata-bootstrap/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

AI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.

## 🚀 Features

### 🤖 **Multi-Provider AI Support**
- **Anthropic Claude**: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models
- **OpenAI GPT**: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models
- **Google Gemini**: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models
- **Automatic Fallback**: Graceful degradation between providers with configurable priorities
- **Provider-Specific Optimization**: Model-specific prompting and parameter tuning

### 🎯 **Granular Feature Control**
- **Individual Feature Flags**: Control each processing feature independently
- **Flexible Processing Modes**: Choose between all, forward-only, or none for relationships
- **Selective Enhancement**: Process only descriptions, only relationships, or both
- **Rebuild Capabilities**: Rebuild existing relationships from scratch when needed

### 🧠 **Advanced AI Generation**
- **Quality Assessment with Retry Logic**: Multi-attempt generation with configurable scoring thresholds
- **Context-Aware Business Descriptions**: Domain-specific system prompts with industry context
- **Smart Field Analysis**: Automatically detects and skips self-explanatory, generic, or cryptic fields
- **Configurable Length Controls**: Precise control over description length and token usage

### 🧠 **Intelligent Caching System** 
- **Similarity-Based Matching**: Reuses descriptions for similar fields across entities (85% similarity threshold)
- **Performance Optimization**: Reduces API calls by up to 70% on large schemas through intelligent caching
- **Cache Statistics**: Real-time performance monitoring with hit rates and API cost savings tracking
- **Type-Aware Matching**: Considers field types and entity context for better cache accuracy

### 🔍 **WordNet-Based Linguistic Analysis**
- **Generic Term Detection**: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields
- **Semantic Density Analysis**: Evaluates conceptual richness and specificity of field names
- **Definition Quality Scoring**: Ensures meaningful, non-circular descriptions through linguistic validation
- **Abstraction Level Calculation**: Determines appropriate description depth based on semantic analysis

### 📝 **Enhanced Acronym Expansion**
- **Comprehensive Mappings**: 200+ pre-configured acronyms for technology, finance, and business domains
- **Context-Aware Expansion**: Industry-specific acronym interpretation based on domain context
- **Pre-Generation Enhancement**: Expands acronyms BEFORE AI generation for better context
- **Custom Domain Support**: Fully configurable acronym mappings via YAML configuration

### 🔗 **Advanced Relationship Detection**
- **Template-Based FK Detection**: Sophisticated foreign key detection with confidence scoring and semantic validation
- **Shared Business Key Relationships**: Many-to-many relationships via shared field analysis with FK-aware filtering
- **Cross-Subgraph Intelligence**: Smart entity matching across different subgraphs
- **Configurable Templates**: Flexible FK template patterns with placeholders for complex naming conventions
- **Advanced Relationship Blocking**: Precision rule-based system to prevent inappropriate cross-connector relationships

### ⚙️ **Comprehensive Configuration System**
- **YAML-First Configuration**: Central `config.yaml` file for all settings with full documentation
- **Waterfall Precedence**: CLI args > Environment variables > config.yaml > defaults
- **Configuration Validation**: Comprehensive validation with helpful error messages and source tracking
- **Feature Toggles**: Granular control over processing features with clear flag names

### 🎯 **Advanced Quality Controls**
- **Buzzword Detection**: Avoids corporate jargon and meaningless generic terms
- **Pattern-Based Filtering**: Regex-based rejection of poor description formats
- **Technical Language Translation**: Converts technical terms to business-friendly language
- **Length Optimization**: Multiple validation layers with hard limits and target lengths

### 🔍 **Intelligent Field Selection**
- **Generic Field Detection**: Skips overly common fields that don't benefit from descriptions
- **Cryptic Abbreviation Handling**: Configurable handling of unclear field names with vowel analysis
- **Self-Explanatory Pattern Recognition**: Automatically identifies fields that don't need descriptions
- **Value Assessment**: Only generates descriptions that add meaningful business value

## 📦 Installation

### From PyPI (Recommended)

```bash
pip install ddn-metadata-bootstrap
```

### Provider-Specific Dependencies

The tool supports multiple AI providers. Install the dependencies for your chosen provider:

```bash
# For Anthropic Claude (default)
pip install ddn-metadata-bootstrap[anthropic]
# or separately:
pip install anthropic

# For OpenAI GPT  
pip install ddn-metadata-bootstrap[openai]
# or separately:
pip install openai

# For Google Gemini
pip install ddn-metadata-bootstrap[gemini]
# or separately: 
pip install google-generativeai

# Install all providers
pip install ddn-metadata-bootstrap[all]
```

### From Source

```bash
git clone https://github.com/hasura/ddn-metadata-bootstrap.git
cd ddn-metadata-bootstrap
pip install -e .
```

## 🏃 Quick Start

### 1. Choose Your AI Provider

#### Option A: Anthropic Claude (Default - Recommended)
```bash
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"  # Optional (default)
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-haiku-20240307"  # Optional
```

#### Option B: OpenAI GPT
```bash
export OPENAI_API_KEY="your-openai-api-key"  
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_OPENAI_MODEL="gpt-3.5-turbo"  # Optional
```

#### Option C: Google Gemini
```bash
export GEMINI_API_KEY="your-gemini-api-key"
# or alternatively:
export GOOGLE_API_KEY="your-gemini-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="gemini"
export METADATA_BOOTSTRAP_GEMINI_MODEL="gemini-pro"  # Optional
```

### 2. Set up your directories

```bash
export METADATA_BOOTSTRAP_INPUT_DIR="./app/metadata"
export METADATA_BOOTSTRAP_OUTPUT_DIR="./enhanced_metadata"
```

### 3. Create a configuration file (Recommended)

Create a `config.yaml` file in your project directory:

```yaml
# config.yaml - DDN Metadata Bootstrap Configuration

# =============================================================================
# GLOBAL PROCESSING CONFIGURATION
# =============================================================================
# Controls which features are enabled and basic processing behavior

# Feature Flags - what processing to perform
create_fk: all                           # all|forward|none - FK relationships
create_shared_keys: all                  # all|forward|none - Shared key relationships
create_command_relationship_hints: true  # true|false - Command relationship hints
create_descriptions: true                # true|false - AI-generated descriptions
rebuild_relationships: false             # true|false - Rebuild existing relationships from scratch

enable_quality_assessment: true         # Enable AI to score and improve its own descriptions

# AI Provider Configuration
ai_provider: "anthropic"  # Choose: anthropic, openai, gemini

# Provider-specific API keys (alternatively set via environment variables)
# anthropic_api_key: "your-anthropic-key"
# openai_api_key: "your-openai-key" 
# gemini_api_key: "your-gemini-key"

# Provider-specific models
anthropic_model: "claude-3-haiku-20240307"  # claude-3-sonnet-20240229, claude-3-opus-20240229
openai_model: "gpt-3.5-turbo"               # gpt-4, gpt-4o-mini, gpt-4-turbo-preview
gemini_model: "gemini-pro"                  # gemini-1.5-pro-latest, gemini-1.5-flash

# =============================================================================
# DESCRIPTION GENERATION CONFIGURATION
# =============================================================================

# Domain-specific system prompt for your organization
system_prompt: |
  You generate concise field descriptions for database schema metadata at a global financial services firm.
  
  DOMAIN CONTEXT:
  - Organization: Global bank
  - Department: Cybersecurity operations  
  - Use case: Risk management and security compliance
  - Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)
  
  Think: "What would a cybersecurity analyst at a bank need to know about this field?"

# Token and length limits
field_tokens: 25                    # Max tokens AI can generate for field descriptions
kind_tokens: 50                     # Max tokens AI can generate for kind descriptions
field_desc_max_length: 120          # Maximum total characters for field descriptions
kind_desc_max_length: 250           # Maximum total characters for entity descriptions

# Quality thresholds
minimum_description_score: 70       # Minimum score (0-100) to accept a description
max_description_retry_attempts: 3   # How many times to retry for better quality

# =============================================================================
# ENHANCED ACRONYM EXPANSION
# =============================================================================
acronym_mappings:
  # Technology & Computing
  api: "Application Programming Interface"
  ui: "User Interface"
  db: "Database"
  
  # Security & Access Management
  mfa: "Multi-Factor Authentication"
  sso: "Single Sign-On"
  iam: "Identity and Access Management"
  siem: "Security Information and Event Management"
  
  # Financial Services & Compliance
  pci: "Payment Card Industry"
  sox: "Sarbanes-Oxley Act"
  kyc: "Know-Your-Customer"
  aml: "Anti-Money Laundering"
  # ... 200+ total mappings available

# =============================================================================
# INTELLIGENT FIELD SELECTION
# =============================================================================
# Fields to skip entirely - these will not get descriptions at all
skip_field_patterns:
  - "^id$"
  - "^_id$"
  - "^uuid$"
  - "^created_at$"
  - "^updated_at$"
  - "^debug_.*"
  - "^test_.*"
  - "^temp_.*"

# Generic fields - won't get unique descriptions (too common)
generic_fields:
  - "id"
  - "key"
  - "uid"
  - "guid"
  - "name"

# Self-explanatory fields - simple patterns that don't need descriptions
self_explanatory_patterns:
  - '^id$'
  - '^_id$'
  - '^guid$'
  - '^uuid$'
  - '^key$'

# Cryptic Field Handling
skip_cryptic_abbreviations: true   # Skip fields with unclear abbreviations
skip_ultra_short_fields: true      # Skip very short field names that are likely abbreviations
max_cryptic_field_length: 4        # Field names this length or shorter are considered cryptic

# Content quality controls
buzzwords: [
  'synergy', 'leverage', 'paradigm', 'ecosystem',
  'contains', 'stores', 'holds', 'represents'
]

forbidden_patterns: [
  'this\\s+field\\s+represents',
  'used\\s+to\\s+(track|manage|identify)',
  'business.*information'
]

# =============================================================================
# RELATIONSHIP DETECTION
# =============================================================================
# FK Template Patterns for relationship detection
# Format: "{pk_pattern}|{fk_pattern}"
# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier
fk_templates:
  - "{gi}|{pm}_{pt}_{gi}"           # active_service_name → Services.name
  - "{gi}|{pt}_{gi}"                # user_id → Users.id
  - "{pt}_{gi}|{pm}_{pt}_{gi}"      # user_id → ActiveUsers.active_user_id

# =============================================================================
# ADVANCED RELATIONSHIP BLOCKING
# =============================================================================
# Precision rule-based system to prevent inappropriate relationships
# Uses bidirectional validation with data_connector + entity + field pattern matching
fk_key_blacklist:
  # Block cross-cloud provider connections with infrastructure fields
  - entity_pattern_a:
      data_connector: "^(gcp|arg|various)$"  # Google Cloud Platform, Azure Resource Graph, Various
      entity: "^(gcp_|google_).*"            # Google/GCP entities
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^(gcp|arg|various)$"  
      entity: "^(az_|azure_).*"              # Azure entities
      field: ".*(resource|project|policy|storage|compute).*"  # Infrastructure fields only
    logic: "and"
    reason: "Block google/gcp entities from connecting to azure entities with infrastructure-related fields"
  
  # Complete isolation between major cloud platforms
  - entity_pattern_a:
      data_connector: "^gcp$"                # Google Cloud Platform connector
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    entity_pattern_b:
      data_connector: "^arg$"                # Azure Resource Graph connector  
      entity: ".*"                           # Any entity
      field: ".*"                            # Any field
    logic: "and"
    reason: "Block all connections between Google Cloud Platform and Azure Resource Graph connectors"

# Shared relationship limits
max_shared_relationships: 10000
max_shared_per_entity: 10
min_shared_confidence: 30

# Shared Key Rejection Patterns - fields matching these won't be used for shared relationships
shared_key_rejection_patterns:
  # Private/Technical Fields (leading underscore indicates internal use)
  - "^_.*$"
  # Primary Identifiers (too generic for meaningful relationships)
  - "^_?(id|key)$"
  # Generic Classification Fields (overly broad categorization)
  - "^(name|type|category|title|code|level|kind)$"
  # State/Status Fields (frequently changing, not structural)
  - "^(status|state|active)$"
  # Audit Fields - Temporal Only (timestamp-based, not relational)
  - "^(created|updated|modified)(_at|_date|_time|_timestamp)?$"
```

### 4. Run the tool with your chosen provider

```bash
# Use default provider (Anthropic) with default settings
ddn-metadata-bootstrap

# Use OpenAI explicitly
ddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key

# Use Gemini with specific model
ddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro

# Show configuration including AI provider setup
ddn-metadata-bootstrap --show-config

# Test your AI provider connection
ddn-metadata-bootstrap --test-provider

# Process only relationships (skip descriptions)
ddn-metadata-bootstrap --create-descriptions false

# Process only descriptions (skip relationships)
ddn-metadata-bootstrap --create-fk none --create-shared-keys none

# Rebuild all relationships from scratch
ddn-metadata-bootstrap --rebuild-relationships

# Use custom configuration file
ddn-metadata-bootstrap --config custom-config.yaml

# Enable verbose logging to see AI provider selection and caching
ddn-metadata-bootstrap --verbose
```

## 🎯 Feature Control System

The tool provides granular control over each processing feature through clean, intuitive flags:

### Core Processing Features

| Feature | Config Key | Values | Description |
|---------|------------|--------|-------------|
| **FK Relationships** | `create_fk` | `all`, `forward`, `none` | Foreign key relationship detection |
| **Shared Key Relationships** | `create_shared_keys` | `all`, `forward`, `none` | Shared field relationship detection |
| **Command Hints** | `create_command_relationship_hints` | `true`, `false` | Command relationship hints |
| **Descriptions** | `create_descriptions` | `true`, `false` | AI-generated descriptions |
| **Rebuild Mode** | `rebuild_relationships` | `true`, `false` | Rebuild existing relationships |

### Processing Modes

#### **All Mode** (`all`)
- Creates relationships in both directions
- Full bidirectional relationship graph
- Best for comprehensive schema understanding

#### **Forward Mode** (`forward`)
- Creates relationships in forward direction only
- Reduces relationship complexity
- Useful for directed schema analysis

#### **None Mode** (`none`)
- Skips the feature entirely
- Fastest processing
- Use when feature not needed

### Feature Combinations

```bash
# Only generate descriptions (no relationships)
create_fk: none
create_shared_keys: none
create_descriptions: true

# Only generate FK relationships (no descriptions or shared keys)
create_fk: all
create_shared_keys: none
create_descriptions: false

# Minimal processing (relationships only, forward direction)
create_fk: forward
create_shared_keys: forward
create_descriptions: false

# Full processing with rebuild
create_fk: all
create_shared_keys: all
create_descriptions: true
rebuild_relationships: true
```

## 🔗 Advanced Relationship Blocking System

The tool includes a sophisticated **bidirectional relationship blocking system** that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.

### Key Features

#### **Precision Pattern Matching**
Each blocking rule uses three-part patterns for maximum precision:
- **Data Connector**: Regex pattern matching the connector name (e.g., `^gcp$`, `^(test|dev)_.*`)
- **Entity Name**: Regex pattern matching the entity/table name (e.g., `^google_.*`, `^azure_storage.*`)
- **Field Name**: Regex pattern matching the field name (e.g., `.*resource.*`, `.*secret.*`)

#### **Bidirectional Validation**
Rules automatically check both directions of a relationship:
- **Pattern A → Pattern B**: `google_compute` → `azure_storage_resource`
- **Pattern B → Pattern A**: `azure_vm` → `google_analytics_data`

Both directions are blocked by a single rule definition.

#### **Flexible Logic Operators**
- **AND Logic**: All patterns (connector AND entity AND field) must match for both sides
- **OR Logic**: Either side matching its full pattern triggers the block

### Real-World Examples

#### **Cross-Cloud Security Isolation**
```yaml
# Block Google Cloud from Azure Resource Graph
- entity_pattern_a:
    data_connector: "^gcp$"        # Google Cloud Platform
    entity: ".*"                   # Any GCP entity
    field: ".*"                    # Any field
  entity_pattern_b:
    data_connector: "^arg$"        # Azure Resource Graph  
    entity: ".*"                   # Any Azure entity
    field: ".*"                    # Any field
  logic: "and"
  reason: "Complete isolation between cloud providers for security compliance"
```

#### **Environment Separation**
```yaml
# Block test environments from production sensitive data
- entity_pattern_a:
    data_connector: "^(test|dev)_.*"
    entity: ".*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^prod_.*"
    entity: ".*"
    field: ".*(pii|ssn|credit_card|password).*"
  logic: "and"
  reason: "Prevent test/dev access to production sensitive data"
```

#### **Infrastructure Boundaries**
```yaml
# Block legacy systems from modern cloud infrastructure
- entity_pattern_a:
    data_connector: "^legacy_.*"
    entity: "^(mainframe|cobol)_.*"
    field: ".*"
  entity_pattern_b:
    data_connector: "^(gcp|aws|azure)_.*"
    entity: "^(kubernetes|container|serverless)_.*"
    field: ".*"
  logic: "and"
  reason: "Prevent direct legacy-to-cloud connections without proper integration layer"
```

### Configuration Validation

The system includes comprehensive validation:

```bash
# Validate your FK blacklist rules
ddn-metadata-bootstrap --validate-config

# Test specific blocking scenarios
ddn-metadata-bootstrap --test-fk-blocking

# Show compiled regex patterns
ddn-metadata-bootstrap --show-config --verbose
```

## 🤖 AI Provider Comparison

### Performance & Cost Comparison

| Provider | Speed | Cost | Quality | Best For |
|----------|-------|------|---------|----------|
| **Anthropic Claude Haiku** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐⭐ High | Development, High Volume |
| **Anthropic Claude Sonnet** | ⚡⚡ Fast | 💰💰 Medium | ⭐⭐⭐⭐⭐ Excellent | Production, Balanced |
| **Anthropic Claude Opus** | ⚡ Medium | 💰💰💰 High | ⭐⭐⭐⭐⭐ Excellent | Critical Schemas |
| **OpenAI GPT-3.5 Turbo** | ⚡⚡⚡ Very Fast | 💰 Very Low | ⭐⭐⭐ Good | Development, Budget |
| **OpenAI GPT-4o Mini** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐⭐ High | Production, Cost-Optimized |
| **OpenAI GPT-4** | ⚡⚡ Fast | 💰💰💰 High | ⭐⭐⭐⭐⭐ Excellent | Premium Quality |
| **Google Gemini Pro** | ⚡⚡ Fast | 💰 Very Low | ⭐⭐⭐⭐ High | Large Scale, Budget |
| **Google Gemini 1.5 Flash** | ⚡⚡⚡ Very Fast | 💰 Low | ⭐⭐⭐ Good | High Throughput |

### Provider-Specific Configuration Examples

#### Anthropic Claude (Recommended)
```yaml
ai_provider: "anthropic"
anthropic_model: "claude-3-haiku-20240307"  # Fast & cost-effective
# anthropic_model: "claude-3-sonnet-20240229"  # Balanced
# anthropic_model: "claude-3-opus-20240229"    # Highest quality

# Anthropic-optimized settings
field_tokens: 30
system_prompt: |
  Generate concise, business-focused field descriptions.
  Focus on practical utility and clear business meaning.
```

#### OpenAI GPT (Cost-Optimized)
```yaml
ai_provider: "openai"
openai_model: "gpt-4o-mini"  # Best balance of cost and quality
# openai_model: "gpt-3.5-turbo"     # Most cost-effective
# openai_model: "gpt-4-turbo-preview"  # Highest quality

# OpenAI-optimized settings
field_tokens: 25
system_prompt: |
  You are a technical writer creating database field descriptions.
  Be concise, specific, and business-focused.
```

#### Google Gemini (High Volume)
```yaml
ai_provider: "gemini"
gemini_model: "gemini-1.5-flash"  # High throughput
# gemini_model: "gemini-pro"           # Balanced
# gemini_model: "gemini-1.5-pro-latest"  # Highest quality

# Gemini-optimized settings
field_tokens: 35
system_prompt: |
  Create clear, professional descriptions for database schema fields.
  Focus on business value and practical understanding.
```

## 📝 Enhanced Examples

### Multi-Provider Description Generation

#### Input Schema (HML)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  fields:
    - name: riskId
      type: String!
    - name: mfaEnabled
      type: Boolean!
    - name: ssoConfig
      type: String
    - name: iamPolicy
      type: String
```

#### Output with Different Providers

##### Anthropic Claude (Business-Focused)
```yaml
kind: ObjectType
version: v1
definition:
  name: ThreatAssessment
  description: |
    Security risk evaluation and compliance status tracking for 
    organizational threat management and regulatory oversight.
  fields:
    - name: riskId
      type: String!
      description: Risk assessment identifier for tracking security evaluations.
    - name: mfaEnabled
      type: Boolean!
      description: Multi-Factor Authentication enablement status for security policy compliance.
    - name: ssoConfig
      type: String
      description: Single Sign-On configuration settings for identity management.
    - name: iamPolicy
      type: String
      description: Identity and Access Management policy governing user permissions.
```

### Feature Control Examples

#### Descriptions Only (No Relationships)
```bash
# CLI
ddn-metadata-bootstrap --create-fk none --create-shared-keys none --create-descriptions true

# Config YAML
create_fk: none
create_shared_keys: none
create_command_relationship_hints: false
create_descriptions: true
```

#### Relationships Only (No Descriptions)
```bash
# CLI
ddn-metadata-bootstrap --create-descriptions false

# Config YAML
create_fk: all
create_shared_keys: all
create_command_relationship_hints: true
create_descriptions: false
```

#### Forward-Only Relationships (Reduced Complexity)
```bash
# Config YAML
create_fk: forward
create_shared_keys: forward
create_command_relationship_hints: true
create_descriptions: true
```

#### Rebuild Mode (Start Fresh)
```bash
# CLI
ddn-metadata-bootstrap --rebuild-relationships

# Config YAML
rebuild_relationships: true
```

## 🐍 Python API with Multi-Provider Support

```python
from ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper
from ddn_metadata_bootstrap.description_generator import DescriptionGenerator
import logging

# Configure logging to see provider selection and caching
logging.basicConfig(level=logging.INFO)

# Method 1: Use configuration file
config = BootstrapperConfig(config_file="./config.yaml")

# Method 2: Programmatic feature control
config = BootstrapperConfig()
config.ai_provider = "openai"
config.openai_api_key = "your-openai-key"
config.openai_model = "gpt-4o-mini"

# Feature control
config.create_descriptions = True
config.create_fk = "all"  # all|forward|none
config.create_shared_keys = "forward"  # all|forward|none
config.create_command_relationship_hints = True
config.rebuild_relationships = False

# Method 3: Direct generator creation with provider
generator = DescriptionGenerator(
    api_key="your-api-key",
    model="claude-3-haiku-20240307",
    provider="anthropic"  # or "openai", "gemini"
)

# Create bootstrapper with feature control
bootstrapper = MetadataBootstrapper(config)

# Process directory with configured features
results = bootstrapper.process_directory(
    input_dir="./app/metadata",
    output_dir="./enhanced_metadata"
)

# Check what features were processed
processing_summary = config.get_processing_summary()
print(f"Processed: {processing_summary}")

# Get provider-specific statistics
stats = bootstrapper.get_statistics()
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model Used: {stats['model_used']}")
print(f"Provider API Calls: {stats['provider_api_calls']}")
print(f"Provider Cost: ${stats['estimated_provider_cost']:.2f}")
```

## 📊 Enhanced Statistics & Monitoring

```python
# Feature-specific performance tracking
stats = bootstrapper.get_statistics()

# Feature processing summary
print(f"Processing Summary: {config.get_processing_summary()}")
print(f"Features Enabled:")
print(f"  - Descriptions: {config.should_create_descriptions()}")
print(f"  - FK Relationships: {config.should_create_fk_relationships()}")
print(f"  - Shared Key Relationships: {config.should_create_shared_key_relationships()}")
print(f"  - Command Hints: {config.should_create_command_relationship_hints()}")
print(f"  - Rebuild Mode: {config.should_rebuild_relationships()}")

# AI Provider metrics
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model: {stats['model_used']}")
print(f"Provider API calls: {stats['provider_api_calls']}")
print(f"Average response time: {stats['avg_response_time_ms']}ms")
print(f"Provider cost: ${stats['estimated_provider_cost']:.3f}")

# Relationship blocking statistics
if 'relationship_stats' in stats:
    rel_stats = stats['relationship_stats']
    print(f"Relationships considered: {rel_stats['relationships_considered']}")
    print(f"Relationships blocked: {rel_stats['relationships_blocked']}")
    print(f"FK blacklist hits: {rel_stats['fk_blacklist_hits']}")
    print(f"Cross-connector blocks: {rel_stats['cross_connector_blocks']}")
```

## 🚀 Provider-Specific Performance Improvements

### Real-World Performance by Provider

#### Anthropic Claude
```bash
Provider: Anthropic Claude Haiku
Processing Features: descriptions, FK relationships (all), shared keys (forward)
Processing 500 fields...
✅ Strengths:
- Excellent business context understanding
- Consistent quality across attempts
- Good acronym expansion integration
- Fast response times (avg 850ms)

📊 Results:
- API calls: 127 (after caching)
- Processing time: 2.1 minutes  
- Average quality score: 82
- Cost: $0.89
```

#### Configuration-Based Performance
```bash
Feature Set: Descriptions only (relationships disabled)
Provider: OpenAI GPT-4o Mini
Processing 500 fields...
✅ Results:
- API calls: 89 (descriptions only)
- Processing time: 1.2 minutes
- Average quality score: 78
- Cost: $0.31

Feature Set: Relationships only (descriptions disabled)
Provider: Local processing
Processing 500 fields...
✅ Results:
- API calls: 0 (no AI needed)
- Processing time: 0.3 minutes
- Relationships generated: 247
- Cost: $0.00
```

## ⚙️ Advanced Multi-Provider Configuration

### Environment-Based Provider Selection

```bash
# Development environment - fast and cheap
export ENVIRONMENT="development"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_CREATE_DESCRIPTIONS="true"
export METADATA_BOOTSTRAP_CREATE_FK="forward"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="none"

# Staging environment - balanced  
export ENVIRONMENT="staging"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="forward"

# Production environment - comprehensive
export ENVIRONMENT="production"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-sonnet-20240229"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="all"
export METADATA_BOOTSTRAP_REBUILD_RELATIONSHIPS="true"
```

## 🧪 Testing Multi-Provider Features

```bash
# Test all providers
pytest tests/test_multi_provider.py -v

# Test feature control system
pytest tests/test_feature_flags.py -v

# Test provider-specific optimizations
pytest tests/test_provider_optimization.py -v

# Test configuration validation for all providers
pytest tests/test_provider_config.py -v

# Test FK blacklist functionality
pytest tests/test_fk_blacklist.py -v

# Run performance benchmarks across providers
pytest tests/benchmark_providers.py -v --benchmark-only
```

## 🤝 Contributing

### Multi-Provider Development Areas

1. **Provider Integration**
   - Additional AI provider support (Claude-4, GPT-5, etc.)
   - Provider-specific optimization algorithms
   - Custom model fine-tuning support

2. **Feature Control Enhancements**
   - Advanced processing pipelines
   - Conditional feature dependencies
   - Performance profiling per feature

3. **Performance Optimization**
   - Provider-specific prompt engineering
   - Dynamic provider selection based on workload
   - Cost optimization strategies

4. **Quality Assessment**
   - Provider-specific quality metrics
   - Cross-provider quality comparison
   - A/B testing frameworks

5. **Relationship Blocking**
   - Visual rule builder for FK blacklists
   - Rule impact analysis and testing
   - Advanced pattern matching algorithms

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🆘 Support

- 📖 [Documentation](https://github.com/hasura/ddn-metadata-bootstrap#readme)
- 🐛 [Bug Reports](https://github.com/hasura/ddn-metadata-bootstrap/issues)
- 💬 [Discussions](https://github.com/hasura/ddn-metadata-bootstrap/discussions)
- 🤖 [AI Provider Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aai-provider)
- 🎯 [Feature Control Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Afeature-control)
- 🧠 [Caching Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Acaching)
- 🔍 [Quality Assessment Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aquality)
- 🔗 [Relationship Blocking Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Arelationship-blocking)

## 🏷️ Version History

See [CHANGELOG.md](CHANGELOG.md) for complete version history and breaking changes.

## ⭐ Acknowledgments

- Built for [Hasura DDN](https://hasura.io/ddn)
- Powered by [Anthropic Claude](https://www.anthropic.com/), [OpenAI GPT](https://openai.com/), and [Google Gemini](https://deepmind.google/technologies/gemini/)
- Linguistic analysis powered by [NLTK](https://www.nltk.org/) and [WordNet](https://wordnet.princeton.edu/)
- Inspired by the GraphQL and OpenAPI communities
- Caching algorithms inspired by database query optimization techniques
- Relationship blocking patterns inspired by enterprise security frameworks

---

Made with ❤️ by the Hasura team
