Metadata-Version: 2.4
Name: information_extractor
Version: 0.2.0
Summary: Information extractor using spaCy + SpanBERT
Home-page: https://github.com/rajatasusual/information_extractor
Author: Rajatasusual
Author-email: krajat4@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: annotated-types==0.7.0
Requires-Dist: blis==0.7.11
Requires-Dist: catalogue==2.0.10
Requires-Dist: certifi==2025.1.31
Requires-Dist: charset-normalizer==3.4.1
Requires-Dist: click==8.1.8
Requires-Dist: cloudpathlib==0.21.0
Requires-Dist: confection==0.1.5
Requires-Dist: coreferee==1.4.1
Requires-Dist: cymem==2.0.11
Requires-Dist: filelock==3.13.1
Requires-Dist: fsspec==2024.6.1
Requires-Dist: idna==3.10
Requires-Dist: Jinja2==3.1.6
Requires-Dist: langcodes==3.5.0
Requires-Dist: language_data==1.3.0
Requires-Dist: marisa-trie==1.2.1
Requires-Dist: markdown-it-py==3.0.0
Requires-Dist: MarkupSafe==3.0.2
Requires-Dist: mdurl==0.1.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: murmurhash==1.0.12
Requires-Dist: networkx==3.3
Requires-Dist: numpy==1.26.4
Requires-Dist: packaging==24.2
Requires-Dist: pathlib_abc==0.1.1
Requires-Dist: pathy==0.11.0
Requires-Dist: pillow==11.0.0
Requires-Dist: preshed==3.0.9
Requires-Dist: psutil==7.0.0
Requires-Dist: pydantic==1.10.21
Requires-Dist: pydantic_core==2.33.1
Requires-Dist: Pygments==2.19.1
Requires-Dist: requests==2.32.3
Requires-Dist: rich==14.0.0
Requires-Dist: scipy==1.15.2
Requires-Dist: shellingham==1.5.4
Requires-Dist: smart-open==6.4.0
Requires-Dist: spacy==3.5.4
Requires-Dist: spacy-legacy==3.0.12
Requires-Dist: spacy-loggers==1.0.5
Requires-Dist: srsly==2.5.1
Requires-Dist: sympy==1.13.1
Requires-Dist: thinc==8.1.12
Requires-Dist: tqdm==4.67.1
Requires-Dist: typer==0.9.4
Requires-Dist: typing-inspection==0.4.0
Requires-Dist: typing_extensions==4.13.2
Requires-Dist: urllib3==2.4.0
Requires-Dist: wasabi==1.1.3
Requires-Dist: weasel==0.4.1
Requires-Dist: wrapt==1.17.2

# information_extractor  

## Overview  
[![CI](https://github.com/rajatasusual/information_extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/rajatasusual/information_extractor/actions/workflows/ci.yml)  
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)  
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)  
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)  

**information_extractor** is a Python package that combines **spaCy**, **coreferee**, and **SpanBERT** to extract structured relationships between entities in natural language text. It's purpose-built for anyone who wants to bridge NER, coreference resolution, and relation extraction into one streamlined pipeline.

## Features

### ✅ Entity Linking & Coreference Resolution
- Uses `spaCy` with `coreferee` to resolve pronouns and link entity mentions.
- Flexible support for multiple entity types: `PERSON`, `ORG`, `LOC`, `DATE`, etc.

### ✅ Relation Extraction with SpanBERT
- Uses fine-tuned SpanBERT model trained on TACRED.
- Handles subject/object marking and context-aware classification.
- Confidence scoring and de-duplication of extracted relations.
- GPU acceleration supported out of the box.

### ✅ CLI Interface
```bash
ie --text "Barack Obama was born in Hawaii." [--deps]
```
- `--deps`: Downloads and installs required pretrained models if not present.

## Installation

```bash
pip install information_extractor
```

### Optional: Download model dependencies
Run the following once to download SpanBERT, spaCy model, coreferee model:
```bash
ie --deps
```

Alternatively, you can import and run the dependency script directly:
```python
from information_extractor.dependency import setup_dependencies
setup_dependencies()
```

## Example Usage

```python
from information_extractor.pipeline import RelationExtractor

text = "Sundar Pichai is the CEO of Google. He lives in California."

extractor = RelationExtractor()
results = extractor.extract(text)

for relation in results:
    print(relation)
```

### Sample Output
```json
[
  {
    "subject": "Sundar Pichai",
    "object": "Google",
    "relation": "per:employee_of",
    "confidence": 0.92
  },
  ...
]
```

## Project Structure
```
information_extractor/
├── assets/
│   └── pretrained_spanbert/
├── dependency.py         # Downloads all model dependencies
├── pipeline.py           # Core logic for NLP + SpanBERT
├── main.py               # CLI entrypoint
```

## Pretrained Assets
Models are downloaded from hosted GitHub release assets:
- ✅ `SpanBERT` weights & config
- ✅ `en_core_web_md` spaCy model
- ✅ `coreferee_model_en` for coreference resolution
- ✅ `torch` wheel for reproducibility

## Citation  

This project builds on the work of Facebook Research. If you use **SpanBERT**, please cite:

```
@article{joshi2019spanbert,
  title={{SpanBERT}: Improving Pre-training by Representing and Predicting Spans},
  author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
  journal={arXiv preprint arXiv:1907.10529},
  year={2019}
}
```

## License

MIT. See [LICENSE](./LICENSE) for full terms.  
Note: This project redistributes pretrained model weights for convenience under fair use for research.
