Metadata-Version: 2.4
Name: corp-names
Version: 0.2.1
Summary: Person name normalization: strip titles, suffixes, resolve nicknames to canonical forms
Author: Neil Ellis
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: click>=8.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: unidecode>=1.3.0
Description-Content-Type: text/markdown

# corp-names

Person name normalization library. Strips titles, suffixes, middle initials, and resolves nicknames to canonical forms. Includes 15,000+ name entries covering English nicknames, international variants, and CJK/Arabic/Japanese/Korean names.

## Installation

```bash
pip install corp-names
```

## Usage

```python
from corp_names import normalize_name

result = normalize_name("Dr. Bob Smith Jr.")
print(result.normalized)  # "robert smith"
print(result.first)       # "robert"
print(result.last)        # "smith"
print(result.prefix)      # "dr"
print(result.suffix)      # "jr"
```

### International name resolution

```python
normalize_name("Guillaume de la Fontaine").normalized  # "william de la fontaine"
normalize_name("Prof. Klaus Müller").normalized         # "nicholas muller"
normalize_name("Giuseppe Rossi").normalized             # "joseph rossi"
```

### Name lookup API

```python
from corp_names import is_known_name, get_name_category

is_known_name("bob")          # True (nickname)
is_known_name("robert")       # True (canonical)
is_known_name("hiroshi")      # True (japanese)

get_name_category("bob")      # "common"
get_name_category("guillaume") # "international"
get_name_category("hiroshi")  # "japanese"
```

## CLI

```bash
corp-names normalize "Dr. Bob Smith Jr."
# Original:   Dr. Bob Smith Jr.
# Normalized: robert smith
# First:      robert
# Last:       smith
# Prefix:     dr
# Suffix:     jr
# Nickname:   resolved to canonical form

corp-names normalize "Sir William H. Gates III" --json
```

## Normalization Pipeline

1. Unicode normalize (unidecode for accented characters)
2. Strip punctuation (remove periods, commas; preserve hyphens)
3. Tokenize by whitespace
4. Strip prefixes (titles, honorifics)
5. Strip suffixes (generational, credentials)
6. Remove middle initials (single-letter tokens)
7. Resolve nicknames to canonical forms
8. Build normalized lowercase output

## NormalizedName Model

```python
class NormalizedName(BaseModel):
    original: str            # Original input
    normalized: str          # Full normalized lowercase name
    first: str               # First/given name (after nickname resolution)
    last: str                # Last/family name
    prefix: str              # Detected prefix(es), e.g. "dr"
    suffix: str              # Detected suffix(es), e.g. "jr"
    nickname_resolved: bool  # Whether first name was a nickname
```

## Data

All name data is stored in `nicknames.json` with categorized entries:

- **Nickname mappings** (~1,350): resolve informal/international names to canonical forms
  - `common`: English nicknames (bob→robert, liz→elizabeth)
  - `international`: cross-language variants (guillaume→william, giuseppe→joseph)
  - `archaic`: historical nicknames (elsa→elizabeth, fanny→frances)
- **Standalone canonical names** (~13,800): known first names with no mapping needed
  - `canonical`: Western names (mark, laura, emma, scott)
  - `cjk`: Chinese names (li, wang, zhang)
  - `japanese`: Japanese names (hiroshi, takashi, akira)
  - `korean`: Korean names (kim, park, choi)
  - `arabic`: Arabic/Islamic names (ali, omar, mustafa)
  - `french_compound`: French compound names (jean-pierre, marie-claire)
- **Prefixes**: ~150 titles and honorifics (military, religious, nobility, academic, political)
- **Suffixes**: ~600 credentials, generational markers, and honors (PhD, Jr, III, OBE, CPA)

## License

MIT
