Metadata-Version: 2.4
Name: corp-extractor
Version: 0.2.1
Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
Project-URL: Repository, https://github.com/corp-o-rate/statement-extractor
Project-URL: Issues, https://github.com/corp-o-rate/statement-extractor/issues
Author-email: Corp-o-Rate <neil@corp-o-rate.com>
Maintainer-email: Corp-o-Rate <neil@corp-o-rate.com>
License: MIT
Keywords: diverse-beam-search,embeddings,gemma,information-extraction,knowledge-graph,nlp,statement-extraction,subject-predicate-object,t5,transformers,triples
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.35.0
Provides-Extra: all
Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2.0; extra == 'embeddings'
Description-Content-Type: text/markdown

# Corp Extractor

Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.

[![PyPI version](https://badge.fury.io/py/corp-extractor.svg)](https://badge.fury.io/py/corp-extractor)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
- **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
- **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness (0-1) based on source text
- **Beam Merging** *(v0.2.0)*: Combines top beams for better coverage instead of picking one
- **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
- **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
- **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries

## Installation

```bash
# Recommended: include embedding support for smart deduplication
pip install corp-extractor[embeddings]

# Minimal installation (no embedding features)
pip install corp-extractor
```

**Note**: For GPU support, install PyTorch with CUDA first:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install corp-extractor[embeddings]
```

## Quick Start

```python
from statement_extractor import extract_statements

result = extract_statements("""
    Apple Inc. announced the iPhone 15 at their September event.
    Tim Cook presented the new features to customers worldwide.
""")

for stmt in result:
    print(f"{stmt.subject.text} ({stmt.subject.type})")
    print(f"  --[{stmt.predicate}]--> {stmt.object.text}")
    print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
```

## New in v0.2.0: Quality Scoring & Beam Merging

By default, the library now:
- **Scores each triple** for groundedness based on whether entities appear in source text
- **Merges top beams** instead of selecting one, improving coverage
- **Uses embeddings** to detect semantically similar predicates ("bought" ≈ "acquired")

```python
from statement_extractor import ExtractionOptions, ScoringConfig

# Precision mode - filter low-confidence triples
scoring = ScoringConfig(min_confidence=0.7)
options = ExtractionOptions(scoring_config=scoring)
result = extract_statements(text, options)

# Access confidence scores
for stmt in result:
    print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
```

## New in v0.2.0: Predicate Taxonomies

Normalize predicates to canonical forms using embedding similarity:

```python
from statement_extractor import PredicateTaxonomy, ExtractionOptions

taxonomy = PredicateTaxonomy(predicates=[
    "acquired", "founded", "works_for", "announced",
    "invested_in", "partnered_with"
])

options = ExtractionOptions(predicate_taxonomy=taxonomy)
result = extract_statements(text, options)

# "bought" -> "acquired" via embedding similarity
for stmt in result:
    if stmt.canonical_predicate:
        print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
```

## Disable Embeddings (Faster, No Extra Dependencies)

```python
options = ExtractionOptions(
    embedding_dedup=False,  # Use exact text matching
    merge_beams=False,      # Select single best beam
)
result = extract_statements(text, options)
```

## Output Formats

```python
from statement_extractor import (
    extract_statements,
    extract_statements_as_json,
    extract_statements_as_xml,
    extract_statements_as_dict,
)

# Pydantic models (default)
result = extract_statements(text)

# JSON string
json_output = extract_statements_as_json(text)

# Raw XML (model's native format)
xml_output = extract_statements_as_xml(text)

# Python dictionary
dict_output = extract_statements_as_dict(text)
```

## Batch Processing

```python
from statement_extractor import StatementExtractor

extractor = StatementExtractor(device="cuda")  # or "cpu"

texts = ["Text 1...", "Text 2...", "Text 3..."]
for text in texts:
    result = extractor.extract(text)
    print(f"Found {len(result)} statements")
```

## Entity Types

| Type | Description | Example |
|------|-------------|---------|
| `ORG` | Organizations | Apple Inc., United Nations |
| `PERSON` | People | Tim Cook, Elon Musk |
| `GPE` | Geopolitical entities | USA, California, Paris |
| `LOC` | Non-GPE locations | Mount Everest, Pacific Ocean |
| `PRODUCT` | Products | iPhone, Model S |
| `EVENT` | Events | World Cup, CES 2024 |
| `WORK_OF_ART` | Creative works | Mona Lisa, Game of Thrones |
| `LAW` | Legal documents | GDPR, Clean Air Act |
| `DATE` | Dates | 2024, January 15 |
| `MONEY` | Monetary values | $50 million, €100 |
| `PERCENT` | Percentages | 25%, 0.5% |
| `QUANTITY` | Quantities | 500 employees, 1.5 tons |
| `UNKNOWN` | Unrecognized | (fallback) |

## How It Works

This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam Search** ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424)):

1. **Diverse Beam Search**: Generates 4+ candidate outputs using beam groups with diversity penalty
2. **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness in source text
3. **Beam Merging** *(v0.2.0)*: Top beams combined for better coverage
4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings

## Requirements

- Python 3.10+
- PyTorch 2.0+
- Transformers 4.35+
- Pydantic 2.0+
- sentence-transformers 2.2+ *(optional, for embedding features)*
- ~2GB VRAM (GPU) or ~4GB RAM (CPU)

## Links

- [Model on HuggingFace](https://huggingface.co/Corp-o-Rate-Community/statement-extractor)
- [Web Demo](https://statement-extractor.corp-o-rate.com)
- [Diverse Beam Search Paper](https://arxiv.org/abs/1610.02424)
- [Corp-o-Rate](https://corp-o-rate.com)

## License

MIT License - see LICENSE file for details.
