Metadata-Version: 2.4
Name: corp-names
Version: 0.3.2
Summary: Person and company name normalization: strip titles, suffixes, legal forms, resolve nicknames to canonical forms
Project-URL: Homepage, https://github.com/corp-o-rate/corp-names
Project-URL: Repository, https://github.com/corp-o-rate/corp-names
Project-URL: Issues, https://github.com/corp-o-rate/corp-names/issues
Author: Neil Ellis
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: click>=8.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: unidecode>=1.3.0
Description-Content-Type: text/markdown

# corp-names

Name normalization library for people and companies. Strips titles, suffixes, middle initials, and resolves nicknames to canonical forms. Includes 15,000+ person name entries and 1,000+ company legal suffixes across 200+ jurisdictions.

## Installation

```bash
pip install corp-names
```

## Usage

```python
from corp_names import normalize_name

result = normalize_name("Dr. Bob Smith Jr.")
print(result.normalized)  # "robert smith"
print(result.first)       # "robert"
print(result.last)        # "smith"
print(result.prefix)      # "dr"
print(result.suffix)      # "jr"
```

### Normalization levels

Control how aggressively nicknames are resolved with the `level` parameter:

```python
# "common" (default) — only English nicknames (bob→robert, bill→william)
normalize_name("Bob Smith").first                        # "robert"
normalize_name("Guillaume Dupont").first                 # "guillaume" (unchanged)

# "international" — common + cross-language variants (guillaume→william, giuseppe→joseph)
normalize_name("Guillaume Dupont", level="international").first  # "william"
normalize_name("Elsa Schmidt", level="international").first      # "elsa" (unchanged)

# "all" — common + international + archaic (elsa→elizabeth, fanny→frances)
normalize_name("Elsa Schmidt", level="all").first  # "elizabeth"
```

### Name lookup API

```python
from corp_names import is_known_name, get_name_category

is_known_name("bob")          # True (nickname)
is_known_name("robert")       # True (canonical)
is_known_name("hiroshi")      # True (japanese)

get_name_category("bob")      # "common"
get_name_category("guillaume") # "international"
get_name_category("hiroshi")  # "japanese"
```

## Company Name Normalization

```python
from corp_names import normalize_company

result = normalize_company("Apple Inc.")
print(result.normalized)   # "apple"
print(result.suffix)       # "inc"
print(result.entity_type)  # "Corporation"

normalize_company("Siemens AG").normalized                  # "siemens"
normalize_company("Samsung Electronics Co., Ltd.").normalized  # "samsung electronics"
normalize_company("Société Générale SA").normalized         # "societe generale"
normalize_company("The Walt Disney Company").normalized     # "walt disney"
normalize_company("Johnson & Johnson").normalized           # "johnson and johnson"
normalize_company("Unilever N.V.").normalized               # "unilever"
normalize_company("Bayerische Motoren Werke GmbH").normalized  # "bayerische motoren werke"
```

## CLI

### Person names
```bash
corp-names normalize "Dr. Bob Smith Jr."
# Original:   Dr. Bob Smith Jr.
# Normalized: robert smith
# First:      robert
# Last:       smith
# Prefix:     dr
# Suffix:     jr
# Nickname:   resolved to canonical form

corp-names normalize "Sir William H. Gates III" --json

corp-names normalize "Guillaume Dupont" --level common        # unchanged
corp-names normalize "Guillaume Dupont" --level international # resolves to william
```

### Company names
```bash
corp-names normalize-company "Apple Inc."
# Original:   Apple Inc.
# Normalized: apple
# Suffix:     inc
# Type:       Corporation

corp-names normalize-company "Samsung Electronics Co., Ltd." --json
```

## Normalization Pipeline

1. Unicode normalize (unidecode for accented characters)
2. Strip punctuation (remove periods, commas; preserve hyphens)
3. Tokenize by whitespace
4. Strip prefixes (titles, honorifics)
5. Strip suffixes (generational, credentials)
6. Remove middle initials (single-letter tokens)
7. Resolve nicknames to canonical forms
8. Build normalized lowercase output

## NormalizedName Model

```python
class NormalizedName(BaseModel):
    original: str            # Original input
    normalized: str          # Full normalized lowercase name
    first: str               # First/given name (after nickname resolution)
    last: str                # Last/family name
    prefix: str              # Detected prefix(es), e.g. "dr"
    suffix: str              # Detected suffix(es), e.g. "jr"
    nickname_resolved: bool  # Whether first name was a nickname
```

## NormalizedCompany Model

```python
class NormalizedCompany(BaseModel):
    original: str            # Original input
    normalized: str          # Cleaned lowercase name without suffixes
    suffix: str              # Detected suffix, e.g. "inc"
    entity_type: str         # Deduced type, e.g. "Corporation"
    the_prefix_stripped: bool # Whether leading "The" was removed
```

## Data

### Company suffixes
1,000+ legal suffixes across 200+ jurisdictions, merged from [cleanco](https://github.com/psolin/cleanco) (MIT) and [GLEIF ELF Code List](https://www.gleif.org/en/about-lei/code-lists/iso-20275-entity-legal-forms-code-list) (ISO 20275). Entity types include Corporation, Limited, LLC, Partnership, Sole Proprietorship, Non-Profit, and more.

### Person names
All name data is stored in `nicknames.json` with categorized entries:

- **Nickname mappings** (~1,350): resolve informal/international names to canonical forms
  - `common`: English nicknames (bob→robert, liz→elizabeth)
  - `international`: cross-language variants (guillaume→william, giuseppe→joseph)
  - `archaic`: historical nicknames (elsa→elizabeth, fanny→frances)
- **Standalone canonical names** (~13,800): known first names with no mapping needed
  - `canonical`: Western names (mark, laura, emma, scott)
  - `cjk`: Chinese names (li, wang, zhang)
  - `japanese`: Japanese names (hiroshi, takashi, akira)
  - `korean`: Korean names (kim, park, choi)
  - `arabic`: Arabic/Islamic names (ali, omar, mustafa)
  - `french_compound`: French compound names (jean-pierre, marie-claire)
- **Prefixes**: ~150 titles and honorifics (military, religious, nobility, academic, political)
- **Suffixes**: ~600 credentials, generational markers, and honors (PhD, Jr, III, OBE, CPA)

## License

MIT
