Metadata-Version: 2.4
Name: llm-ner
Version: 0.5.1
Summary: Schema-driven Named Entity Recognition powered by local LLMs via Ollama
Project-URL: Homepage, https://github.com/ManuelMunozBer/llm-ner
Project-URL: Documentation, https://github.com/ManuelMunozBer/llm-ner#readme
Project-URL: Bug Tracker, https://github.com/ManuelMunozBer/llm-ner/issues
License: MIT
License-File: LICENSE
Keywords: information-extraction,llm,named-entity-recognition,ner,nlp,ollama,pydantic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.0
Requires-Dist: requests>=2.28
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: numpy>=1.24; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: scipy>=1.10; extra == 'dev'
Requires-Dist: tqdm>=4.65; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Provides-Extra: test
Requires-Dist: numpy>=1.24; extra == 'test'
Requires-Dist: pandas>=2.0; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Requires-Dist: scipy>=1.10; extra == 'test'
Requires-Dist: tqdm>=4.65; extra == 'test'
Description-Content-Type: text/markdown

# llm-ner

**Schema-driven Named Entity Recognition powered by local LLMs via Ollama.**

`llm-ner` lets you define arbitrary extraction schemas as plain Pydantic models
and extract structured entities from free text – without training a custom model.
Every extracted value is paired with a short verbatim *evidence* quote from the
source, making results auditable and explainable.

---

## Features

- **Schema-first** – define what to extract with pure Python + Pydantic; the
  library builds the LLM prompt automatically.
- **Evidence tracking** – every field carries an `evidence` quote that must
  appear verbatim in the source text.
- **Smart retries** – automatically re-runs extraction and merges results when
  fields are missing.
- **Tolerant parsing** – invalid enum values, malformed numbers, bad dates, etc.
  become `None` instead of crashing.
- **Fully typed** – ships with a `py.typed` marker and complete type
  annotations.
- **No cloud required** – runs entirely on a local [Ollama](https://ollama.com)
  instance.

---

## Installation

### With `uv` (recommended)

```bash
# Install uv if you don't have it
pip install uv

# Clone the repository
git clone https://github.com/ManuelMunozBer/llm-ner.git
cd llm-ner

# Create a virtual environment and install the package
uv venv
uv pip install -e .

# With test dependencies
uv pip install -e ".[test]"

# With all development dependencies
uv pip install -e ".[dev]"
```

### With pip

```bash
pip install llm-ner
```

### Prerequisites

A running [Ollama](https://ollama.com) instance with your chosen model:

```bash
ollama serve
ollama pull qwen2.5:7b-instruct   # or any instruction-following model
```

---

## Quick Start

```python
from llmner import NERBaseModel, NERExtractor, SchemaRegistry

# 1. Create a registry – one per schema
registry = SchemaRegistry()

# 2. Define typed field annotations
GenderType = registry.categorical(
    "gender",
    options=["male", "female"],
    instruction="Extract the subject's gender.",
)
AgeType = registry.int_range(
    "age",
    "Extract the subject's age as an integer or range (e.g. '25-30').",
)
NameType = registry.generic(
    "name",
    "Extract the subject's full name.",
)

# 3. Define your Pydantic extraction schema
class PersonSchema(NERBaseModel):
    name:   NameType   | None = None  # type: ignore[valid-type]
    gender: GenderType | None = None  # type: ignore[valid-type]
    age:    AgeType    | None = None  # type: ignore[valid-type]

# 4. Create the extractor
extractor = NERExtractor(
    schema_class=PersonSchema,
    system_role="You are an expert information extractor.",
    system_task=(
        "Extract the requested fields from the text. "
        "Return null for any field not mentioned."
    ),
    rules_registry=registry.rules,
)

# 5. Extract
result = extractor.extract_one(
    "Detective John Smith, 42, was assigned to the case."
)

print(result.name.value)     # "John Smith"
print(result.name.evidence)  # "John Smith"
print(result.age.value)      # "42"
print(result.gender.value)   # "male"
```

---

## Concepts

### `SchemaRegistry`

A `SchemaRegistry` instance is used to create self-documenting Pydantic field
types.  Each factory call registers a rule that will be injected into the LLM
prompt.

```python
registry = SchemaRegistry()

# Categorical field – only values from the allowed list are accepted
StatusType = registry.categorical(
    "status",
    options={"active": "currently employed", "inactive": "no longer employed"},
    instruction="Extract the person's employment status.",
)

# Integer / range field
SalaryType = registry.int_range(
    "salary",
    "Extract the annual salary in thousands of euros.",
)

# Free-text field
AddressType = registry.generic(
    "address",
    "Extract the full postal address.",
)

# Datetime field – normalised to YYYY-MM-DD HH:MM:SS
DateType = registry.datetime_format(
    "date",
    "Extract the contract signing date.",
)
```

### `EvidenceField`

Every factory produces `Annotated[EvidenceField, ...]` types.  An
`EvidenceField` has two attributes:

| Attribute  | Type           | Description                                                  |
|------------|----------------|--------------------------------------------------------------|
| `value`    | `str \| None`  | The normalised extracted value.                              |
| `evidence` | `str \| None`  | Verbatim quote from the source text that justifies `value`.  |

```python
field: EvidenceField = result.name
print(field.value)     # "John Smith"
print(field.evidence)  # "John Smith, 42"
```

Evidence is validated: if the quote does not appear verbatim in the source text
it is set to `None`.

**Automatic evidence resolution** — after validation, evidence is automatically
resolved using a priority chain:

1. **Full value match**: if the extracted value appears verbatim in the source
   text (≥ 3 characters), it becomes the evidence — even if the model provided
   a different quote.  The canonical value is the most precise anchor.
1a. **Full value match (descored)**: if the value contains underscores (e.g.
   `"physical_assault"`) and the underscore-to-space form (`"physical assault"`)
   appears in the text, that form is returned as evidence.
1b. **Full raw value match**: if the normalised value is not found but the
   pre-transformation form (e.g. `"1.75"` before metre→cm conversion) appears
   in the text, the raw value is returned as evidence.
1c. **Full raw value match (descored)**: same as 1a but applied to the raw
   (pre-transformation) value.
2. **Model evidence**: if neither value nor raw value is found but the model
   provided a valid evidence quote, it is kept unchanged.
3. **Partial prefix fallback**: when no model evidence exists, the longest
   token-prefix of the value that appears in the text is used (minimum 3
   characters).  Both the original and descored (underscore→space) forms are
   tried.  Single-token prefixes are tried, so e.g. `"2024-01-01"` from
   a datetime `"2024-01-01 08:13:00"` can serve as evidence.
3b. **Partial raw value prefix**: same as rule 3 but applied to the raw
   (pre-transformation) value (and its descored form).
4. `None` — no usable evidence could be determined.

### `NERBaseModel`

Your extraction schemas must subclass `NERBaseModel`.  It adds four reflection-
based utilities:

| Method             | Description                                                    |
|--------------------|----------------------------------------------------------------|
| `prompt_schema()`  | Generate the JSON skeleton injected into the LLM prompt.       |
| `has_missing_fields()` | Return `True` if any nested `EvidenceField.value` is `None`. |
| `merge(e1, e2, *, input_text="")` | Fill `None` values in `e1` with values from `e2`; resolve conflicts using the source text when provided. |
| `safe_parse(data)` | Tolerantly parse LLM output, isolating per-field errors.       |

### `NERExtractor`

The main orchestrator.  Key parameters:

| Parameter         | Default                    | Description                                   |
|-------------------|----------------------------|-----------------------------------------------|
| `schema_class`    | –                          | Your `NERBaseModel` subclass.                 |
| `system_role`     | –                          | LLM persona / expertise description.          |
| `system_task`     | –                          | Extraction task and constraints.              |
| `rules_registry`  | –                          | `registry.rules` from your `SchemaRegistry`.  |
| `llm_model`       | `"qwen2.5:7b-instruct"`    | Ollama model tag.                             |
| `llm_base_url`    | `"http://localhost:11434"` | Ollama server URL.                            |
| `llm_temperature` | `1.0`                      | Sampling temperature (`0.0` = deterministic). |
| `max_retries`     | `1`                        | Extra calls on incomplete extraction.         |

---

## Nested Schemas

```python
class Address(NERBaseModel):
    street: registry.generic("street", "Street name and number.") | None = None  # type: ignore[valid-type]
    city:   registry.generic("city",   "City name.")              | None = None  # type: ignore[valid-type]

class PersonSchema(NERBaseModel):
    name:    NameType    | None = None   # type: ignore[valid-type]
    address: Address     | None = None
    suspects: list[SuspectSchema] = []
```

`prompt_schema()` and `safe_parse()` handle arbitrary nesting and lists of
sub-models automatically.

---

## Advanced Usage

### Custom LLM client

Implement `BaseLLMClient` to use a different inference backend:

```python
from llmner.llm_client import BaseLLMClient

class MyClient(BaseLLMClient):
    def generate(self, prompt: str) -> dict | None:
        # Call your backend here
        ...

extractor = NERExtractor(
    ...,
    llm_client=MyClient(),
)
```

### Custom prompt template

```python
from llmner import DEFAULT_PROMPT_TEMPLATE

MY_TEMPLATE = """\
[INST] {system_role}

{system_task}

Rules:
{rules_text}

Schema:
{schema_json}

Text: {input_text} [/INST]
"""

extractor = NERExtractor(
    ...,
    prompt_template=MY_TEMPLATE,
)
```

### Fallback parsers

When the LLM returns a `null` value but provides a non-null evidence quote, a
**fallback parser** can attempt to recover the value from the evidence string.
Every factory method accepts an optional `fallback_parser` callback:

```python
import re

# Recover age from evidence like "aged 34"
AgeType = registry.int_range(
    "age",
    "Extract the subject's age.",
    fallback_parser=lambda ev: m.group() if (m := re.search(r"\d+", ev)) else None,
)

# Recover gender from contextual clues in evidence
GenderType = registry.categorical(
    "gender",
    options=["male", "female"],
    instruction="Extract the subject's gender.",
    fallback_parser=lambda ev: "male" if "man" in ev.lower() else None,
)
```

The callback signature is `(evidence: str) -> str | None`.  When it returns a
non-`None` value, that value is fed through the factory's normal validation
pipeline (option matching, range parsing, date normalisation, etc.).

### Extra datetime formats

`datetime_format` accepts an `extra_formats` tuple of additional
`strptime` format strings appended after the built-in ones:

```python
DateType = registry.datetime_format(
    "date",
    "Extract the event date.",
    extra_formats=("%B %d, %Y", "%d %b %Y"),  # "March 15, 2024", "15 Mar 2024"
)
```

### Per-field evidence requirement

By default, extracted values are kept even when no supporting evidence can be
found in the source text.  To enforce grounding on a per-field basis, pass
`evidence_required=True` to any factory method:

```python
# This field will be set to None if no evidence is found in the source text
NameType = registry.generic(
    "name",
    "Full name of the person.",
    evidence_required=True,
)

# This field keeps its value even without evidence (default behaviour)
AgeType = registry.int_range("age", "Age in years.")
```

### Controlling retries

```python
# Disable retries
result = extractor.extract_one(text, retry_on_null=False)

# Configure at extractor level
extractor = NERExtractor(..., max_retries=3)
```

---

## Extraction Pipeline Flow

Below is the complete data flow from input text to validated output.

```
 INPUT TEXT
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  1. PROMPT BUILDING  (PromptBuilder)                    │
│  ─────────────────────────────────────                  │
│  • system_role + system_task                            │
│  • Per-field extraction rules (from SchemaRegistry)     │
│  • JSON schema skeleton (from NERBaseModel.prompt_schema)│
│  • The input text itself                                │
│  ⇒ Assembled into a single prompt string                │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│  2. LLM CALL  (BaseLLMClient / OllamaClient)           │
│  ─────────────────────────────────────                  │
│  • POST prompt to Ollama /api/generate (JSON mode)      │
│  • Parse the raw JSON response into a Python dict       │
│  • Returns None on network/parse errors                 │
└────────────────────────┬────────────────────────────────┘
                         │  raw dict
                         ▼
┌─────────────────────────────────────────────────────────┐
│  3. SAFE PARSE  (NERBaseModel.safe_parse)               │
│  ─────────────────────────────────────                  │
│  Three-phase tolerant validation:                       │
│                                                         │
│  Phase 1 — List items: validate each item in list-of-   │
│  model fields individually. Bad items get per-field     │
│  fallback (good fields kept, bad → None).               │
│                                                         │
│  Phase 2 — Full model: attempt model_validate() with    │
│  context={input_text}. This triggers all BeforeValidator │
│  pipelines (step 4 below). If it succeeds → done.       │
│                                                         │
│  Phase 3 — Field-by-field fallback: validate each       │
│  field in isolation. Fields that fail → None.            │
│  Nested models are validated field-by-field too.        │
└────────────────────────┬────────────────────────────────┘
                         │  for each field (during Phase 2/3)
                         ▼
┌─────────────────────────────────────────────────────────┐
│  4. FIELD VALIDATION  (BeforeValidator in each factory) │
│  ─────────────────────────────────────                  │
│  For every EvidenceField-type field, the validator runs │
│  this pipeline:                                         │
│                                                         │
│  a) _extract_ev(v) — unpack {value, evidence} from      │
│     the raw dict or EvidenceField object                │
│                                                         │
│  b) _coerce_null(raw) — convert "null"/"none"/"" → None │
│                                                         │
│  c) _validate_evidence(evidence, info) — check that the │
│     model's evidence quote exists in the source text    │
│     (case-insensitive, whitespace-normalised).          │
│     Invalid quotes → None.                              │
│                                                         │
│  d) _try_fallback(raw, evidence, fallback_parser) —     │
│     if raw is None but evidence exists, attempt to      │
│     recover a value from the evidence string            │
│                                                         │
│  e) TYPE-SPECIFIC NORMALISATION:                        │
│     • categorical: lowercase, apply replacements,       │
│       match against allowed options list                │
│     • int_range: strip units, convert m→cm, parse       │
│       integers or MIN-MAX ranges                        │
│     • generic: coerce to python_type                    │
│     • datetime_format: parse with strptime, normalise   │
│       to "YYYY-MM-DD HH:MM:SS"                         │
│                                                         │
│  f) _resolve_evidence(value, evidence, info,            │
│     raw_value=raw_str):                                 │
│     Rule 1:  value in text        → value as evidence   │
│     Rule 1a: descored value in text → descored as ev.   │
│     Rule 1b: raw_value in text    → raw_value as ev.    │
│     Rule 1c: descored raw in text → descored as ev.     │
│     Rule 2:  model evidence valid → keep it             │
│     Rule 3:  partial token-prefix of value in text      │
│              (also tries descored form)                  │
│     Rule 3b: partial token-prefix of raw_value          │
│              (also tries descored form)                  │
│     Rule 4:  → None                                     │
│                                                         │
│  g) _apply_evidence_required(ef, evidence_required,     │
│     info):                                              │
│     if evidence_required=True (per-field)               │
│     AND input_text available                            │
│     AND value ≠ None AND evidence = None                │
│     → discard value (set to None)                       │
│                                                         │
│  ⇒ Returns EvidenceField(value=..., evidence=...)       │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│  5. RETRY & MERGE  (NERExtractor.extract_one)           │
│  ─────────────────────────────────────                  │
│  • If has_missing_fields() → True and retry_on_null:    │
│    - Call LLM again (up to max_retries times)           │
│    - Merge results: first.value takes priority;         │
│      None values filled from second extraction          │
│    - Lists merged by index position                     │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
              VALIDATED OUTPUT
         (NERBaseModel instance)
         Every non-null value has
         evidence from the source text
```

### Key Guarantee

When `evidence_required=True` is set on a factory (opt-in, per-field),
**every non-null `value` for that field has a non-null `evidence`** that
appears verbatim in the source text.  Values that the LLM extracted correctly
but cannot be grounded in the text are discarded (`value → None`).  This
ensures zero hallucinated entities at the cost of potentially lower recall.
The default is `evidence_required=False`, so values are kept even when no
supporting evidence can be found.

---

## Running the Examples

```bash
# Make sure Ollama is running and the model is available
ollama pull qwen2.5:7b-instruct

# Run the crime extraction example
python examples/crime_extraction/run.py
```

---

## Running the Tests

Integration tests require a live Ollama instance.  Mark them accordingly:

```bash
# Run only unit tests (no Ollama needed)
pytest tests/ -m "not integration"

# Run all tests including integration
pytest tests/ -m integration -v
```

---

## Project Structure

```
llm-ner/
├── src/
│   └── llmner/
│       ├── __init__.py        # Public API
│       ├── base_model.py      # NERBaseModel
│       ├── factories.py       # SchemaRegistry + EvidenceField
│       ├── extractor.py       # NERExtractor
│       ├── llm_client.py      # OllamaClient
│       └── prompt.py          # PromptBuilder
├── tests/
│   ├── conftest.py            # Pytest configuration
│   ├── schema/
│   │   └── crime_schema.py    # Crime-specific schema (integration test)
│   ├── data/
│   │   ├── complaints.csv
│   │   ├── crimes_perceived_detailed.csv
│   │   └── perceived_suspects.csv
│   └── test_ner_accuracy.py   # End-to-end accuracy test
├── examples/
│   └── crime_extraction/
│       ├── schema.py          # English crime schema example
│       └── run.py             # Runnable example script
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## Contributing

1. Fork the repository and create a feature branch.
2. Install development dependencies: `uv pip install -e ".[dev]"`.
3. Run linting: `ruff check src/`.
4. Run type checking: `mypy src/llmner`.
5. Open a pull request with a clear description of your changes.

---

## License

MIT – see [LICENSE](LICENSE).
