Metadata-Version: 2.4
Name: diarizer-lite
Version: 0.1.0
Summary: Lightweight stereo-based diarization for timestamped STT segments
Author: Ashwin B
License-Expression: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydub>=0.25.1
Dynamic: license-file

# diarizer-lite

A Lightweight speaker diarization library using stereo audio energy for timestamped speech-to-text (STT) transcripts.

## What this package does

Most modern STT systems such as - Whisper, Whisper-faster, Deepgram, AssemblyAI, AWS Transcribe, Google STT, etc...
can output timestamped segments of text, but they either:

- don’t do speaker diarization at all, or
- require cloud billing for diarization, or
- produce overly granular segments (short “speaker turns”), or
- lack merging into coherent speaker turns for downstream use

`diarizer-lite` fills this gap by performing **lightweight speaker assignment** using stereo audio
channel energy (left/right RMS) to determine who spoke when.

Unlike ML-based diarization (e.g. Pyannote, embeddings, VAD models), this method:

- does **not** require GPUs
- does **not** require machine learning
- runs offline
- works with raw stereo call recordings with any format (mp3, wav, etc.)
- merges consecutive segments into human-readable speaker turns

This makes it ideal for:

- call center analytics
- customer ↔ agent conversations
- post-processing Whisper transcripts
- LLM conversation input formatting (context engineering)
- summarization pipelines
- sentiment/emotion analysis
- compliance/regulatory auditing

## Why this matters

Large Language Models (LLMs) can summarize, analyze sentiment, and extract insights from transcripts —
but only if speaker turns are well structured.

Example:
> “Okay. Okay. Okay.”  
doesn’t help a summarizer as separate lines, but merged as one speaker turn it becomes useful context.

`diarizer-lite` converts fragmented timestamped segments into proper conversational turns, enabling
LLMs to better understand **who said what** and **when**.

## Supported Inputs

- Audio: `.wav`, `.mp3`, `.flac` (stereo recommended)
- Segments: list of `{start, end, text}` dicts from STT systems

Compatible with outputs from:
- Whisper / Faster-Whisper
- Deepgram
- AssemblyAI
- AWS Transcribe (post-word-grouping)
- Google Speech-to-Text (post-word-grouping)
- Riva / Vosk / Coqui
- Any STT that outputs timestamps

## Installation

```bash
pip install diarizer-lite
```

## Usage

```python
from diarizer_lite import Diarizer

d = Diarizer()

diarized = d.diarize_segments(
    audio_file="call.wav",
    segments=segments
)
```

## Before / After Example

### Input (timestamped STT segments)

```python
segments = [
    {"start": 0.0, "end": 4.0, "text": "Hi, I wanted to report an issue with my ride this morning."},
    {"start": 4.0, "end": 7.0, "text": "Sure, could you tell me what happened?"},
    {"start": 7.0, "end": 13.0, "text": "Yeah, driver was polite but car wasn't clean and smelled weird."},
    {"start": 13.0, "end": 17.0, "text": "I'm sorry to hear that. Anything else?"},
    {"start": 17.0, "end": 22.0, "text": "Also he took a longer route even after I gave the correct address."},
    {"start": 22.0, "end": 25.0, "text": "Got it. We will look into this and report back."}
]
```

### Output (diarized speaker turns)

```python
[
  {"start": 0.0, "end": 4.0, "speaker": "Speaker0",
   "text": "Hi, I wanted to report an issue with my ride this morning."},

  {"start": 4.0, "end": 7.0, "speaker": "Speaker1",
   "text": "Sure, could you tell me what happened?"},

  {"start": 7.0, "end": 13.0, "speaker": "Speaker0",
   "text": "Yeah, driver was polite but car wasn't clean and smelled weird."},

  {"start": 13.0, "end": 17.0, "speaker": "Speaker1",
   "text": "I'm sorry to hear that. Anything else?"},

  {"start": 17.0, "end": 25.0, "speaker": "Speaker0",
   "text": "Also he took a longer route even after I gave the correct address. Got it. We will look into this and report back."}
]
```

## Mono Fallback

If audio is mono, diarizer-lite cannot infer speakers and returns:

```vtt
speaker="Unknown"
```

## License

MIT License
