Metadata-Version: 2.4
Name: corp-entity-db
Version: 0.3.0
Summary: Entity database for organizations, people, roles, and locations with embedding search
License-Expression: MIT
Requires-Python: >=3.10
Requires-Dist: click>=8.0.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pycountry>=24.6.1
Requires-Dist: pydantic>=2.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: usearch>=2.0.0
Provides-Extra: all
Requires-Dist: fastapi>=0.100.0; extra == 'all'
Requires-Dist: httpx>=0.25.0; extra == 'all'
Requires-Dist: indexed-bzip2; extra == 'all'
Requires-Dist: orjson; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.20.0; extra == 'all'
Provides-Extra: build
Requires-Dist: indexed-bzip2; extra == 'build'
Requires-Dist: orjson; extra == 'build'
Provides-Extra: client
Requires-Dist: httpx>=0.25.0; extra == 'client'
Provides-Extra: serve
Requires-Dist: fastapi>=0.100.0; extra == 'serve'
Requires-Dist: uvicorn[standard]>=0.20.0; extra == 'serve'
Description-Content-Type: text/markdown

# corp-entity-db

Entity database library and search engine for organizations, people, roles, and locations. Provides embedding-based semantic search over entities imported from GLEIF, SEC Edgar, Wikidata, and Companies House.

## Installation

```bash
# Default: search and resolve (no build dependencies)
pip install corp-entity-db

# With database build/import support
pip install "corp-entity-db[build]"

# With HTTP server (corp-entity-db serve)
pip install "corp-entity-db[serve]"

# With remote client (EntityDBClient)
pip install "corp-entity-db[client]"

# Everything
pip install "corp-entity-db[all]"
```

The default install includes sentence-transformers, USearch, and huggingface_hub for searching and downloading pre-built databases. The embedding model (`google/embeddinggemma-300m`, 300M params) is downloaded automatically on first use.

## Quick Start

```bash
# Download the lite database + USearch indexes
corp-entity-db download

# Search organizations
corp-entity-db search "Microsoft"
corp-entity-db search "Microsoft" --hybrid

# Search people (composite embeddings: name + role + org)
corp-entity-db search-people "Tim Cook"
corp-entity-db search-people "Tim Cook" --role CEO --org Apple

# Show database statistics
corp-entity-db status
```

## Python API

```python
from corp_entity_db import OrganizationDatabase, CompanyEmbedder, get_database_path

# Search organizations
db = OrganizationDatabase(get_database_path())
embedder = CompanyEmbedder()
matches = db.search(embedder.embed("Microsoft"), top_k=10)
for record, score in matches:
    print(f"{record.name} ({record.entity_type}) - score: {score:.3f}")

# Search people with composite embeddings + identity fallback
from corp_entity_db import PersonDatabase, get_person_database
from corp_entity_db.store import format_person_query
person_db = get_person_database()
query_emb = embedder.embed_composite_person("Tim Cook", role="CEO", org="Apple")
identity_emb = embedder.embed_for_identity_index(format_person_query("Tim Cook", person_type="executive"))
matches = person_db.search(query_emb, top_k=5, identity_query_embedding=identity_emb)
```

## Server Mode

Keep models warm in memory for low-latency repeated searches (requires `[serve]` extra):

```bash
corp-entity-db serve                  # Start on localhost:8222
corp-entity-db serve --port 9000      # Custom port
```

## Data Sources

| Source | Description | Scale |
|--------|-------------|-------|
| Wikidata | Organizations & notable people | ~1.5M orgs, ~13.2M people |
| GLEIF | Legal Entity Identifier records | ~2.6M orgs |
| SEC Edgar | US public company filers & officers | ~73K orgs |
| Companies House | UK registered companies | ~5.5M orgs |

## Embedding Architecture

**Organizations**: Embeddings are stored as 768-dim float32 BLOBs in the `organizations` table. The full database enforces NOT NULL on the embedding column. Int8 scalar quantization is computed on-the-fly during USearch HNSW index building and is not stored separately.

**People (Dual-Index Search)**: People use two USearch HNSW indexes, both generated on-the-fly during index building (no embeddings stored in SQLite):

- **Primary composite index** (`people_usearch.bin`, 768-dim): Name, role, and organization are embedded as separate 256-dim vectors using Matryoshka truncation, independently L2-normalized, weighted (name=8, role=1, org=4), and concatenated. This gives AND-style matching: a poor match on organization cannot be compensated by a good match on name, enabling precise queries like "find the CEO named Tim Cook at Apple." Built by `build_people_composite_index()`.
- **Secondary identity index** (`people_identity_usearch.bin`, 256-dim): Natural language descriptions (e.g. "Taylor Swift, an artist", "Tim Cook, a CEO of Apple") embedded with Matryoshka truncation to 256 dims. Consulted as fallback when composite scores are below threshold (0.75). This improves accuracy for identity-defined people (artists, athletes, media, activists) who lack role/org context and would otherwise waste 512 of 768 composite dims as zeros. Built by `build_people_identity_index()`.

Search accuracy: 82.5% acc@1, 91.4% acc@20 on 280 queries across 14 person types (60-80ms per query after model warmup), with identity fallback improving accuracy for identity-defined types.

## Database Variants

- **Lite** (default download): Organization embedding column dropped, uses pre-built USearch HNSW indexes for search
- **Full**: Includes float32 embedding BLOBs in the organizations table

In both variants, people embeddings exist only in USearch index files (`people_usearch.bin` and `people_identity_usearch.bin`), never in SQLite.

## Database Management

```bash
corp-entity-db post-import             # Generate embeddings + build USearch indexes + VACUUM
corp-entity-db build-index             # Rebuild all USearch HNSW indexes
corp-entity-db build-index --identity-only  # Rebuild only the people identity index
corp-entity-db repair-embeddings       # Generate missing embeddings + rebuild indexes
corp-entity-db migrate-embeddings      # Migrate from legacy vec0 tables to embedding column
```

HuggingFace dataset: [Corp-o-Rate-Community/entity-references](https://huggingface.co/datasets/Corp-o-Rate-Community/entity-references)
