Metadata-Version: 2.4
Name: earlymodernner
Version: 0.2.0
Summary: Named Entity Recognition for Early Modern English documents (1500-1800)
Author: Jacob Polay
License: MIT
Project-URL: Homepage, https://github.com/polayj/earlymodernner
Project-URL: Repository, https://github.com/polayj/earlymodernner
Keywords: NER,named entity recognition,early modern,historical,digital humanities
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: peft>=0.10.0
Requires-Dist: bitsandbytes>=0.43.0
Requires-Dist: accelerate>=0.30.0
Requires-Dist: huggingface_hub>=0.20.0
Provides-Extra: train
Requires-Dist: datasets>=2.19.0; extra == "train"
Requires-Dist: pyyaml>=6.0; extra == "train"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=24.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Dynamic: license-file

# EarlyModernNER

Named Entity Recognition for Early Modern English documents (1500-1800).

## Overview

EarlyModernNER extracts four types of entities from historical texts:

| Entity Type | Description | Examples |
|-------------|-------------|----------|
| **TOPONYM** | Place names | London, Jamaica, West Indies |
| **PERSON** | Individual people | Oliver Cromwell, Governor Modyford |
| **ORGANIZATION** | Institutions | East India Company, Parliament |
| **COMMODITY** | Trade goods & materials | sugar, tobacco, silk |

## Performance

Evaluated on 100 gold-standard annotated documents:

| Entity Type | Precision | Recall | F1 |
|-------------|-----------|--------|-----|
| TOPONYM | 0.93 | 0.82 | 0.87 |
| PERSON | 0.93 | 0.69 | 0.80 |
| ORGANIZATION | 0.93 | 0.46 | 0.62 |
| COMMODITY | 0.85 | 0.80 | 0.83 |
| **Overall** | **0.89** | **0.77** | **0.83** |

## Quick Start

### Installation

```bash
pip install earlymodernner
```

Or install from source:
```bash
git clone https://github.com/polayj/earlymodernner.git
cd earlymodernner
pip install -e .
```

Model adapters (~680MB total) are automatically downloaded from Hugging Face Hub on first use.

### Usage

```bash
# Process a single file
python -m earlymodernner --input document.txt --output results.jsonl

# Process a directory
python -m earlymodernner --input /path/to/docs/ --output results.jsonl

# Output as CSV
python -m earlymodernner --input docs/ --output results.csv --csv

# Pre-download adapters (optional, for offline use)
python -m earlymodernner --download
```

### Output Format

**JSONL** (default):
```json
{
  "doc_id": "document_name",
  "text": "The sugar trade between Jamaica and Bristol...",
  "entities": [
    {"text": "Jamaica", "type": "TOPONYM"},
    {"text": "Bristol", "type": "TOPONYM"},
    {"text": "sugar", "type": "COMMODITY"}
  ]
}
```

**CSV** (with `--csv`):
```csv
doc_id,entity_text,entity_type
document_name,Jamaica,TOPONYM
document_name,Bristol,TOPONYM
document_name,sugar,COMMODITY
```

## Requirements

- Python 3.9+
- CUDA-compatible GPU with 8GB+ VRAM
- See `requirements.txt` for dependencies

## Project Structure

```
earlymodernner/
├── earlymodernner/          # Main package
│   ├── __main__.py          # CLI entry point
│   ├── pipeline.py          # Inference pipeline
│   ├── constants.py         # Entity types & prompts
│   └── adapters/            # Trained LoRA adapters
├── dev/                     # Training & development tools
│   ├── train_lora.py        # Training script
│   ├── evaluate.py          # Evaluation script
│   ├── training.md          # Training documentation
│   └── config/              # Training configurations
├── docs/                    # Documentation
│   ├── usage.md             # Detailed usage guide
│   └── corpus.md            # Training corpus details
└── results/                 # Default output directory
```

## Documentation

- **[Usage Guide](docs/usage.md)** - Detailed usage instructions, input/output formats
- **[Training Corpus](docs/corpus.md)** - Data sources and annotation process
- **[Training Guide](dev/training.md)** - How to train your own adapters

## How It Works

EarlyModernNER uses an **ensemble approach** with four specialized models:

1. Each entity type has its own fine-tuned LoRA adapter
2. Documents are processed by all four adapters
3. Results are merged using priority-based cascade (TOPONYM → COMMODITY → PERSON → ORGANIZATION)
4. Overlapping entities are resolved by giving priority to higher-performing models

**Technical details:**
- Base model: Qwen3-4B-Instruct
- Fine-tuning: QLoRA (4-bit quantization)
- Training: Silver-standard annotations + synthetic hard negatives

## Citation

```bibtex
@software{earlymodernner,
  title = {EarlyModernNER: Named Entity Recognition for Early Modern English},
  author = {Polay, Jacob},
  year = {2026},
  url = {https://github.com/polayj/earlymodernner}
}
```

## License

MIT License

## Author

Jacob Polay, MA Student, University of Saskatchewan

## Acknowledgments

- Built on [Qwen](https://github.com/QwenLM/Qwen) models
- Uses [PEFT](https://github.com/huggingface/peft) for efficient fine-tuning
- Training data from Old Bailey Online, PCEEC2, Royal Society Corpus, EEBO, and Archive.org
