Metadata-Version: 2.4
Name: llm-ner
Version: 0.1.0
Summary: Schema-driven Named Entity Recognition powered by local LLMs via Ollama
Project-URL: Homepage, https://github.com/ManuelMunozBer/llm-ner
Project-URL: Documentation, https://github.com/ManuelMunozBer/llm-ner#readme
Project-URL: Bug Tracker, https://github.com/ManuelMunozBer/llm-ner/issues
License: MIT
License-File: LICENSE
Keywords: information-extraction,llm,named-entity-recognition,ner,nlp,ollama,pydantic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.0
Requires-Dist: requests>=2.28
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: numpy>=1.24; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: scipy>=1.10; extra == 'dev'
Requires-Dist: tqdm>=4.65; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Provides-Extra: test
Requires-Dist: numpy>=1.24; extra == 'test'
Requires-Dist: pandas>=2.0; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Requires-Dist: scipy>=1.10; extra == 'test'
Requires-Dist: tqdm>=4.65; extra == 'test'
Description-Content-Type: text/markdown

# llm-ner

**Schema-driven Named Entity Recognition powered by local LLMs via Ollama.**

`llm-ner` lets you define arbitrary extraction schemas as plain Pydantic models
and extract structured entities from free text – without training a custom model.
Every extracted value is paired with a short verbatim *evidence* quote from the
source, making results auditable and explainable.

---

## Features

- **Schema-first** – define what to extract with pure Python + Pydantic; the
  library builds the LLM prompt automatically.
- **Evidence tracking** – every field carries an `evidence` quote that must
  appear verbatim in the source text.
- **Smart retries** – automatically re-runs extraction and merges results when
  fields are missing.
- **Tolerant parsing** – invalid enum values, malformed numbers, bad dates, etc.
  become `None` instead of crashing.
- **Fully typed** – ships with a `py.typed` marker and complete type
  annotations.
- **No cloud required** – runs entirely on a local [Ollama](https://ollama.com)
  instance.

---

## Installation

### With `uv` (recommended)

```bash
# Install uv if you don't have it
pip install uv

# Clone the repository
git clone https://github.com/ManuelMunozBer/llm-ner.git
cd llm-ner

# Create a virtual environment and install the package
uv venv
uv pip install -e .

# With test dependencies
uv pip install -e ".[test]"

# With all development dependencies
uv pip install -e ".[dev]"
```

### With pip

```bash
pip install llm-ner
```

### Prerequisites

A running [Ollama](https://ollama.com) instance with your chosen model:

```bash
ollama serve
ollama pull qwen2.5:7b-instruct   # or any instruction-following model
```

---

## Quick Start

```python
from llmner import NERBaseModel, NERExtractor, SchemaRegistry

# 1. Create a registry – one per schema
registry = SchemaRegistry()

# 2. Define typed field annotations
GenderType = registry.categorical(
    "gender",
    options=["male", "female"],
    instruction="Extract the subject's gender.",
)
AgeType = registry.int_range(
    "age",
    "Extract the subject's age as an integer or range (e.g. '25-30').",
)
NameType = registry.generic(
    "name",
    "Extract the subject's full name.",
)

# 3. Define your Pydantic extraction schema
class PersonSchema(NERBaseModel):
    name:   NameType   | None = None  # type: ignore[valid-type]
    gender: GenderType | None = None  # type: ignore[valid-type]
    age:    AgeType    | None = None  # type: ignore[valid-type]

# 4. Create the extractor
extractor = NERExtractor(
    schema_class=PersonSchema,
    system_role="You are an expert information extractor.",
    system_task=(
        "Extract the requested fields from the text. "
        "Return null for any field not mentioned."
    ),
    rules_registry=registry.rules,
)

# 5. Extract
result = extractor.extract_one(
    "Detective John Smith, 42, was assigned to the case."
)

print(result.name.value)     # "John Smith"
print(result.name.evidence)  # "John Smith"
print(result.age.value)      # "42"
print(result.gender.value)   # "male"
```

---

## Concepts

### `SchemaRegistry`

A `SchemaRegistry` instance is used to create self-documenting Pydantic field
types.  Each factory call registers a rule that will be injected into the LLM
prompt.

```python
registry = SchemaRegistry()

# Categorical field – only values from the allowed list are accepted
StatusType = registry.categorical(
    "status",
    options={"active": "currently employed", "inactive": "no longer employed"},
    instruction="Extract the person's employment status.",
)

# Integer / range field
SalaryType = registry.int_range(
    "salary",
    "Extract the annual salary in thousands of euros.",
)

# Free-text field
AddressType = registry.generic(
    "address",
    "Extract the full postal address.",
)

# Datetime field – normalised to YYYY-MM-DD HH:MM:SS
DateType = registry.datetime_format(
    "date",
    "Extract the contract signing date.",
)
```

### `EvidenceField`

Every factory produces `Annotated[EvidenceField, ...]` types.  An
`EvidenceField` has two attributes:

| Attribute  | Type           | Description                                                  |
|------------|----------------|--------------------------------------------------------------|
| `value`    | `str \| None`  | The normalised extracted value.                              |
| `evidence` | `str \| None`  | Verbatim quote from the source text that justifies `value`.  |

```python
field: EvidenceField = result.name
print(field.value)     # "John Smith"
print(field.evidence)  # "John Smith, 42"
```

Evidence is validated: if the quote does not appear verbatim in the source text
it is set to `None`.

### `NERBaseModel`

Your extraction schemas must subclass `NERBaseModel`.  It adds four reflection-
based utilities:

| Method             | Description                                                    |
|--------------------|----------------------------------------------------------------|
| `prompt_schema()`  | Generate the JSON skeleton injected into the LLM prompt.       |
| `has_missing_fields()` | Return `True` if any nested `EvidenceField.value` is `None`. |
| `merge(e1, e2)`    | Fill `None` values in `e1` with values from `e2`.              |
| `safe_parse(data)` | Tolerantly parse LLM output, isolating per-field errors.       |

### `NERExtractor`

The main orchestrator.  Key parameters:

| Parameter         | Default                    | Description                                   |
|-------------------|----------------------------|-----------------------------------------------|
| `schema_class`    | –                          | Your `NERBaseModel` subclass.                 |
| `system_role`     | –                          | LLM persona / expertise description.          |
| `system_task`     | –                          | Extraction task and constraints.              |
| `rules_registry`  | –                          | `registry.rules` from your `SchemaRegistry`.  |
| `llm_model`       | `"qwen2.5:7b-instruct"`    | Ollama model tag.                             |
| `llm_base_url`    | `"http://localhost:11434"` | Ollama server URL.                            |
| `max_retries`     | `1`                        | Extra calls on incomplete extraction.         |

---

## Nested Schemas

```python
class Address(NERBaseModel):
    street: registry.generic("street", "Street name and number.") | None = None  # type: ignore[valid-type]
    city:   registry.generic("city",   "City name.")              | None = None  # type: ignore[valid-type]

class PersonSchema(NERBaseModel):
    name:    NameType    | None = None   # type: ignore[valid-type]
    address: Address     | None = None
    suspects: list[SuspectSchema] = []
```

`prompt_schema()` and `safe_parse()` handle arbitrary nesting and lists of
sub-models automatically.

---

## Advanced Usage

### Custom LLM client

Implement `BaseLLMClient` to use a different inference backend:

```python
from llmner.llm_client import BaseLLMClient

class MyClient(BaseLLMClient):
    def generate(self, prompt: str) -> dict | None:
        # Call your backend here
        ...

extractor = NERExtractor(
    ...,
    llm_client=MyClient(),
)
```

### Custom prompt template

```python
from llmner import DEFAULT_PROMPT_TEMPLATE

MY_TEMPLATE = """\
[INST] {system_role}

{system_task}

Rules:
{rules_text}

Schema:
{schema_json}

Text: {input_text} [/INST]
"""

extractor = NERExtractor(
    ...,
    prompt_template=MY_TEMPLATE,
)
```

### Controlling retries

```python
# Disable retries
result = extractor.extract_one(text, retry_on_null=False)

# Configure at extractor level
extractor = NERExtractor(..., max_retries=3)
```

---

## Running the Examples

```bash
# Make sure Ollama is running and the model is available
ollama pull qwen2.5:7b-instruct

# Run the crime extraction example
python examples/crime_extraction/run.py
```

---

## Running the Tests

Integration tests require a live Ollama instance.  Mark them accordingly:

```bash
# Run only unit tests (no Ollama needed)
pytest tests/ -m "not integration"

# Run all tests including integration
pytest tests/ -m integration -v
```

---

## Project Structure

```
llm-ner/
├── src/
│   └── llmner/
│       ├── __init__.py        # Public API
│       ├── base_model.py      # NERBaseModel
│       ├── factories.py       # SchemaRegistry + EvidenceField
│       ├── extractor.py       # NERExtractor
│       ├── llm_client.py      # OllamaClient
│       └── prompt.py          # PromptBuilder
├── tests/
│   ├── conftest.py            # Pytest configuration
│   ├── schema/
│   │   └── crime_schema.py    # Crime-specific schema (integration test)
│   ├── data/
│   │   ├── complaints.csv
│   │   ├── crimes_perceived_detailed.csv
│   │   └── perceived_suspects.csv
│   └── test_ner_accuracy.py   # End-to-end accuracy test
├── examples/
│   └── crime_extraction/
│       ├── schema.py          # English crime schema example
│       └── run.py             # Runnable example script
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## Contributing

1. Fork the repository and create a feature branch.
2. Install development dependencies: `uv pip install -e ".[dev]"`.
3. Run linting: `ruff check src/`.
4. Run type checking: `mypy src/llmner`.
5. Open a pull request with a clear description of your changes.

---

## License

MIT – see [LICENSE](LICENSE).
