Metadata-Version: 2.4
Name: nomox-semantic-model
Version: 0.1.0
Summary: Semantic data model for LLM-consumable data catalog
Author-email: nomox <support@tvargl.eu>
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# Nomox LLM Semantic Model

Internal python package for describing semantics model used across Nomox.

## Installation

To use the packge from anywhere run:

```bash
pip install git+https://github.com/MiraZzle/nomox-semantics-package.git
```

For development:

```bash
pip install -e .
```

## Architecture

The semantic model is organized into three layers:

### Level 1: Source-Scoped Semantics

Produced by the Level 1 Indexer Agent. Contains:

- **DataSource**: Top-level container for a data source (Trino catalog.schema)
- **Table**: Tables and views with semantic roles and temporal information
- **Column**: Columns with semantic types, profiling, and sample values
- **InternalRelationship**: Foreign key relationships within a source

### Level 2: Cross-Source Semantics

Produced by the Level 2 Aggregator Agent. Contains:

- **SemanticEntity**: Canonical business concepts (Customer, Order, Product)
- **EntityManifestation**: Where entities appear across sources
- **UnifiedAttribute**: Logical attributes sourced from multiple places
- **EntityRelationship**: Relationships between entities with join paths
- **IdentityResolution**: How to match entities across sources

### Shared Components

- **GlossaryTerm**: Business terminology definitions
- **ConfidenceScore**: Confidence scoring for all elements
- **ExpertOverride**: Human corrections and enhancements
- **IndexingState**: Tracking of indexing jobs and status

## Quick Start

```python
from semantic_model import (
    SemanticModel,
    DataSource,
    Table,
    Column,
    SemanticType,
    SemanticCategory,
    SourceType,
    create_empty_model,
    save_model,
    load_model,
)

# Create an empty model
model = create_empty_model(
    model_id="my-org-model",
    organization_id="my-org",
)

# Create a data source
source = DataSource(
    id="sales-db",
    name="Sales Database",
    trino_catalog="analytics",
    trino_schema="sales",
    fully_qualified_prefix="analytics.sales",
    source_type=SourceType.ANALYTICAL,
    description="Sales transaction data warehouse",
    domain="Sales",
)

# Create a table
orders_table = Table(
    id="orders",
    name="orders",
    fully_qualified_name="analytics.sales.orders",
    description="Fact table containing one row per order",
    columns=[
        Column(
            id="order_id",
            name="order_id",
            ordinal_position=0,
            data_type="VARCHAR",
            is_primary_key=True,
            semantic_type=SemanticType.identifier(subtype="uuid"),
            description="Unique order identifier",
        ),
        Column(
            id="customer_id",
            name="customer_id",
            ordinal_position=1,
            data_type="VARCHAR",
            is_foreign_key=True,
            semantic_type=SemanticType.identifier(subtype="uuid"),
            description="ID of the customer who placed the order",
        ),
        Column(
            id="total_amount",
            name="total_amount",
            ordinal_position=2,
            data_type="DECIMAL(12,2)",
            semantic_type=SemanticType(
                category=SemanticCategory.CURRENCY,
                confidence=0.95,
            ),
            unit="USD",
            description="Total order value including tax",
        ),
    ],
)

# Add table to source
source = source.add_table(orders_table)

# Add source to model
model = model.add_source(source)

# Save the model
save_model(model, "semantic_model.json")

# Load the model
loaded_model = load_model("semantic_model.json")

# Generate prompt context for LLM
prompt_context = model.to_prompt_format(
    include_sources=True,
    include_entities=True,
    include_glossary=True,
)
print(prompt_context)
```

## Working with Confidence Scores

```python
from semantic_model import ConfidenceScore, LowConfidenceItem, ConfidenceObjectType

# Create a confidence score
confidence = ConfidenceScore(
    overall=0.75,
    threshold=0.8,
    schema_understanding=0.9,
    semantic_typing=0.7,
    description_quality=0.65,
    low_confidence_items=[
        LowConfidenceItem(
            object_type=ConfidenceObjectType.COLUMN,
            object_id="status_code",
            object_name="status_code",
            score=0.4,
            reason="Unknown categorical values",
            suggested_clarification="What do status codes 'P', 'A', 'R' mean?",
        ),
    ],
)

# Check if meets threshold
if not confidence.meets_threshold:
    print("Source needs expert review")
    for item in confidence.low_confidence_items:
        print(f"  - {item.object_name}: {item.reason}")
```

## Expert Overrides

```python
from semantic_model import ExpertOverride, ReindexScope

# Create an override
override = ExpertOverride(
    id="override-001",
    created_by="domain-expert@company.com",
    field_path="description",
    original_value="Unknown table",
    override_value="Customer master data from CRM system",
    reason="Clarified based on CRM documentation",
    reindex_scope=ReindexScope.THIS_SOURCE,
)

# Apply to a table
table.expert_overrides.append(override)
```

## Semantic Entities (Level 2)

```python
from semantic_model import (
    SemanticEntity,
    EntityManifestation,
    ManifestationRole,
    UnifiedAttribute,
    EntityRelationship,
    JoinPath,
    JoinStep,
)

# Create a semantic entity
customer_entity = SemanticEntity(
    id="customer",
    name="Customer",
    description="A customer is any individual or organization with an account",
    canonical_id_name="customer_id",
    canonical_id_format="UUID",
    domain="Sales",
    manifestations=[
        EntityManifestation(
            source_id="crm-db",
            table_id="accounts",
            fully_qualified_name="crm.public.accounts",
            role=ManifestationRole.PRIMARY,
            key_column_id="account_id",
            usage_guidance="Use for real-time customer master data",
        ),
        EntityManifestation(
            source_id="analytics-db",
            table_id="customer_360",
            fully_qualified_name="analytics.customers.customer_360",
            role=ManifestationRole.DERIVED,
            key_column_id="customer_id",
            usage_guidance="Use for analytics with pre-computed metrics",
        ),
    ],
)

# Add to model
model = model.add_entity(customer_entity)
```

## Serialization

```python
from semantic_model import save_model, load_model
from semantic_model.serialization import ModelExporter, save_model_yaml

# Save as JSON
save_model(model, "model.json")

# Save as YAML (requires PyYAML)
save_model_yaml(model, "model.yaml")

# Export utilities
exporter = ModelExporter(model)

# Get prompt-ready context
context = exporter.to_prompt_context(max_tokens=4000)

# Get source summary
summary = exporter.to_source_summary()
```
