Metadata-Version: 2.4
Name: semanticwer
Version: 0.1.0
Summary: SemanticWER: Meaning-Aware ASR Evaluation Toolkit for speech-to-LLM systems
Author: SemanticWER Contributors
License: MIT
Project-URL: Homepage, https://github.com/semanticwer/semanticwer
Project-URL: Documentation, https://semanticwer.readthedocs.io
Project-URL: Repository, https://github.com/semanticwer/semanticwer
Project-URL: Issues, https://github.com/semanticwer/semanticwer/issues
Keywords: asr,speech recognition,wer,word error rate,nlp,evaluation,metrics,llm,semantic similarity,named entity recognition,torchmetrics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Provides-Extra: full
Requires-Dist: sentence-transformers>=2.7; extra == "full"
Requires-Dist: spacy>=3.7; extra == "full"
Requires-Dist: rouge-score>=0.1.2; extra == "full"
Provides-Extra: semantic
Requires-Dist: sentence-transformers>=2.7; extra == "semantic"
Provides-Extra: spacy
Requires-Dist: spacy>=3.7; extra == "spacy"
Provides-Extra: hf
Requires-Dist: transformers>=4.35; extra == "hf"
Requires-Dist: torch>=2.0; extra == "hf"
Provides-Extra: rouge
Requires-Dist: rouge-score>=0.1.2; extra == "rouge"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: sentence-transformers>=2.7; extra == "dev"
Requires-Dist: spacy>=3.7; extra == "dev"
Requires-Dist: rouge-score>=0.1.2; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# 🔥 SemanticWER

**Evaluation framework for speech-to-LLM systems.**

[![PyPI version](https://badge.fury.io/py/semanticwer.svg)](https://badge.fury.io/py/semanticwer)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

Classic Word Error Rate (WER) measures *token accuracy*. But modern pipelines look like this:

```
Speech → ASR → LLM → Task (QA, summarization, agents, RAG)
```

A 20% WER transcript can **preserve meaning** — or **completely break downstream reasoning**. WER cannot tell the difference.

**SemanticWER** fixes this with a four-component composite score:

```
SemanticWER = w₁·L + w₂·E + w₃·S + w₄·T
```

| Component | What it measures |
|-----------|-----------------|
| **L** — Lexical | Standard WER + CER (NIST-compatible) |
| **E** — Entity | Named entity preservation (PERSON, ORG, DATE, …) |
| **S** — Semantic | Embedding cosine similarity (SBERT) |
| **T** — Task | Downstream task success delta |

Lower score = better transcript quality.

---

## Installation

```bash
# Minimal (WER/CER + regex NER + Jaccard semantic fallback)
pip install semanticwer

# Recommended (full features)
pip install "semanticwer[full]"
python -m spacy download en_core_web_sm
```

---

## Quick Start

```python
from semanticwer import SemanticWER

metric = SemanticWER()  # defaults: weights=(0.3, 0.2, 0.3, 0.2)

result = metric(
    reference="The patient was prescribed 50mg of metformin twice daily",
    hypothesis="The patient was prescribed 15mg of metformin twice daily",
)

print(result.summary())
# ====================================================
#   SemanticWER Result
# ====================================================
#   Composite Score  : 0.3241  (lower = better)
# ----------------------------------------------------
#   [L] Lexical      : WER=0.1429  CER=0.0541  (w=0.30)
#   [E] Entity       : F1=0.8000  Recall=0.6667  (w=0.20)
#   [S] Semantic     : Sim=0.8923  (w=0.30)
#   [T] Task         : N/A  (w=0.20)
# ====================================================

print(result.wer)           # 0.1429
print(result.semantic_sim)  # 0.8923
print(result.entity_f1)     # 0.8000
print(result.score)         # 0.3241
```

---

## torchmetrics-Style API

```python
metric = SemanticWER(weights=(0.3, 0.2, 0.3, 0.2))

# Accumulate samples
for ref, hyp in dataset:
    metric.update(ref, hyp)

# Compute over full corpus
result = metric.aggregate()
print(f"Corpus SemanticWER: {result.score:.4f}")
```

---

## HuggingFace evaluate-Style API

```python
result = metric.compute(
    predictions=hypotheses,
    references=references,
)
```

---

## Task Utility: The Game-Changer

Connect SemanticWER to your actual downstream task:

### Built-in: ROUGE

```python
from semanticwer import SemanticWER
from semanticwer.modules.task import TaskModule

metric = SemanticWER(
    weights=(0.25, 0.25, 0.25, 0.25),
    task_fn=TaskModule.rouge_adapter("rougeL"),
)
result = metric(ref, hyp)
print(result.task_score)  # 0.0–1.0
```

### Built-in: Token F1 (SQuAD-style QA)

```python
metric = SemanticWER(
    task_fn=TaskModule.f1_token_adapter(),
    weights=(0.25, 0.25, 0.25, 0.25),
)
```

### Custom: Any callable

```python
def my_qa_eval(reference: str, hypothesis: str) -> float:
    """Return 1.0 if hypothesis preserves the answer to our question."""
    ref_answer = qa_model(question="Who was mentioned?", context=reference)
    hyp_answer = qa_model(question="Who was mentioned?", context=hypothesis)
    return 1.0 if ref_answer == hyp_answer else 0.0

metric = SemanticWER(
    task_fn=my_qa_eval,
    weights=(0.2, 0.2, 0.3, 0.3),
)
```

### Custom: LLM-as-judge

```python
import anthropic

client = anthropic.Anthropic()

def llm_judge(reference: str, hypothesis: str) -> float:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": (
                f"Score semantic equivalence 0.0–1.0 (1.0 = identical meaning).\n"
                f"REF: {reference}\nHYP: {hypothesis}\n"
                f"Respond with only a float."
            ),
        }],
    )
    return float(response.content[0].text.strip())

metric = SemanticWER(
    task_fn=TaskModule.llm_judge_adapter(llm_judge),
    weights=(0.2, 0.2, 0.3, 0.3),
)
```

---

## NER Backend Selection

```python
# spaCy (default, best accuracy for English)
metric = SemanticWER(ner_backend="spacy")

# HuggingFace transformers pipeline
metric = SemanticWER(ner_backend="hf")

# Lightweight regex (no extra deps)
metric = SemanticWER(ner_backend="regex")

# Disable entity scoring
metric = SemanticWER(ner_backend="none")
```

---

## CLI

```bash
# Single pair
semanticwer --ref "John Smith called at 3pm" --hyp "Tom Jones called at 9am"

# Files (one sentence per line)
semanticwer --ref ref.txt --hyp hyp.txt

# With ROUGE task scoring
semanticwer --ref ref.txt --hyp hyp.txt --task rouge

# JSON output (for pipelines)
semanticwer --ref ref.txt --hyp hyp.txt --output json

# Custom weights
semanticwer --ref ref.txt --hyp hyp.txt --weights 0.4 0.2 0.3 0.1

# CSV output
semanticwer --ref ref.txt --hyp hyp.txt --output csv
```

---

## Result Object

```python
result = metric(ref, hyp)

result.score            # Composite SemanticWER [0, 1]
result.wer              # Classic WER
result.cer              # Character Error Rate
result.entity_f1        # Entity F1 score
result.entity_recall    # Entity recall
result.semantic_sim     # Cosine similarity [0, 1]
result.task_score       # Task utility score (or None)

result.to_dict()        # Full breakdown as dict
result.to_json()        # Full breakdown as JSON string
result.summary()        # Human-readable table
```

---

## Reproducibility / Custom Weights

Weights must sum to 1.0. Recommended presets:

| Use case | Weights (L, E, S, T) |
|----------|---------------------|
| General ASR evaluation | `(0.3, 0.2, 0.3, 0.2)` |
| Medical / legal (entity-critical) | `(0.2, 0.4, 0.2, 0.2)` |
| LLM pipeline (task-first) | `(0.15, 0.15, 0.3, 0.4)` |
| Backward-compatible WER | `(1.0, 0.0, 0.0, 0.0)` |

---

## Citation

If you use SemanticWER in research, please cite:

```bibtex
@software{semanticwer2024,
  title     = {SemanticWER: Meaning-Aware ASR Evaluation Toolkit},
  year      = {2024},
  url       = {https://github.com/semanticwer/semanticwer},
}
```

---

## License

MIT
