Metadata-Version: 2.4
Name: lansonai-vadtools
Version: 0.3.0
Summary: Voice Activity Detection (VAD) and Transcription package for audio/video processing
Project-URL: Homepage, https://github.com/lansonai/vadtools
Project-URL: Documentation, https://github.com/lansonai/vadtools
Project-URL: Repository, https://github.com/lansonai/vadtools
Author-email: LansonAI <info@lansonai.com>
License: MIT
Keywords: audio,speech,vad,video,voice-activity-detection
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Requires-Python: >=3.12
Requires-Dist: librosa>=0.10.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: soundfile>=0.12.0
Requires-Dist: torch>=2.0.0
Requires-Dist: torchaudio>=2.0.0
Provides-Extra: dev
Requires-Dist: aiofiles; extra == 'dev'
Requires-Dist: fastapi==0.116.1; extra == 'dev'
Requires-Dist: httpx; extra == 'dev'
Requires-Dist: psycopg2-binary==2.9.9; extra == 'dev'
Requires-Dist: pydantic==2.11.7; extra == 'dev'
Requires-Dist: python-multipart==0.0.20; extra == 'dev'
Requires-Dist: uvicorn==0.35.0; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest-cov>=4.0.0; extra == 'test'
Requires-Dist: pytest>=7.0.0; extra == 'test'
Description-Content-Type: text/markdown

> **Status**: Active  
> **Last Updated**: 2025-12-27  
> **Version**: Current

# VAD Service Documentation

This document provides comprehensive documentation for the Voice Activity Detection (VAD) service, which runs as a serverless HTTP API on Modal.com.

## Overview

The VAD service is a serverless HTTP API that analyzes audio/video files to detect speech segments. It uses the `lansonai-vadtools` Python package (published on PyPI) and runs on Modal.com for automatic scaling and isolation.

**Key Features:**
- Serverless architecture (Modal.com)
- Automatic scaling (up to 10 concurrent requests)
- Supports audio and video formats
- Exports detected speech segments
- Returns detailed analysis results

## Architecture

The service consists of:
- **Modal API**: `modal_api.py` - Serverless HTTP API endpoint
- **Python Package**: `lansonai-vadtools` - Core VAD processing logic (published on PyPI)
- **Integration**: Main API calls Modal endpoint via HTTP

## Quick Start

### For API Users

The VAD service is already deployed and accessible via Modal. The main API automatically uses it when processing audio tasks.

**Production Endpoint**: `https://deth--analyze.modal.run`

### For Developers

#### Prerequisites

- Python 3.12+
- Modal CLI installed globally: `pip install modal`
- Modal account (free tier available)

#### Local Development

1. **Install Modal CLI** (if not already installed):
   ```bash
   pip install modal
   ```

2. **Login to Modal**:
   ```bash
   modal token new
   ```

3. **Run locally for testing**:
   ```bash
   cd scripts/python/vad
   modal serve modal_api.py
   ```

4. **Test the local API**:
   ```bash
   curl -X POST http://localhost:8000/analyze \
     -H "Content-Type: application/json" \
     -d '{
       "file_url": "https://example.com/audio.wav",
       "threshold": 0.3
     }'
   ```

5. **Deploy to production** (after testing):
   ```bash
   modal deploy modal_api.py
   ```

## API Documentation

### POST `/analyze`

Analyzes an audio/video file for voice activity.

**Request Body (JSON):**
```json
{
  "file_url": "https://example.com/audio.wav",  // Required: Public URL to audio/video file
  "threshold": 0.3,                             // Optional: VAD threshold (0.0-1.0), default 0.3
  "min_segment_duration": 0.5,                  // Optional: Minimum segment duration (seconds), default 0.5
  "max_merge_gap": 0.2,                         // Optional: Maximum merge gap (seconds), default 0.2
  "export_segments": false,                     // Optional: Export audio segments, default false
  "output_format": "wav",                       // Optional: Output format ("wav" or "flac"), default "wav"
  "request_id": "custom-id"                     // Optional: Custom request ID
}
```

**Response (Success):**
```json
{
  "request_id": "abc123...",
  "total_segments": 42,
  "total_duration": 120.5,
  "overall_speech_ratio": 0.85,
  "segments": [
    {
      "id": 1,
      "start_time": 0.0,
      "end_time": 2.5,
      "duration": 2.5,
      "speech_confidence": 0.95
    }
  ],
  "summary": {
    "total_duration": 120.5,
    "total_speech_duration": 102.4,
    "overall_speech_ratio": 0.85,
    "num_segments": 42
  },
  "performance": {
    "total_processing_time": 15.2,
    "speed_ratio": 7.9
  }
}
```

### GET `/health`

Health check endpoint.

**Response:**
```json
{
  "status": "ok",
  "service": "vad-api",
  "version": "0.2.0"
}
```

## Using the Python Package

The VAD service is built on top of the `lansonai-vadtools` Python package, which can also be used directly:

### Installation

```bash
pip install lansonai-vadtools
```

### Basic Usage

```python
from lansonai.vadtools import analyze

result = analyze(
    input_path="audio.wav",
    output_dir="./output",
    threshold=0.3,
    min_segment_duration=0.5,
    max_merge_gap=0.2,
    export_segments=True,
    output_format="wav"
)

print(f"Detected {result['total_segments']} speech segments")
print(f"Speech ratio: {result['overall_speech_ratio'] * 100:.1f}%")
```

### Return Value Structure

```python
{
    "request_id": str,
    "input_file": str,
    "output_dir": str,
    "json_path": str,              # Path to timestamps.json
    "segments_dir": str | None,    # Path to segments directory (if exported)
    "segments": List[Dict],         # VAD segment list
    "summary": Dict,                # Statistics
    "performance": Dict,             # Performance metrics
    "metadata": Dict,                # Metadata
    "total_segments": int,
    "total_duration": float,
    "overall_speech_ratio": float
}
```

For detailed package usage examples, see [USAGE.md](./USAGE.md).

## Deployment

### Modal Configuration

The service uses Modal secrets for environment variables:

```bash
# Create secret with Supabase credentials (for segment uploads)
modal secret create vad-secrets \
  SUPABASE_URL=https://your-project.supabase.co \
  SUPABASE_ANON_KEY=your-anon-key
```

### Resource Allocation

Current configuration (in `modal_api.py`):
- **CPU**: 2.0 cores
- **Memory**: 4096 MB (4 GB)
- **Timeout**: 300 seconds (5 minutes)
- **Concurrency**: 10 requests

### Cost Estimation

Based on current configuration:
- **Per request**: ~$0.002-0.004 (1-2 minutes processing)
- **Free tier**: $30/month covers ~7,500-15,000 requests

For detailed deployment instructions, see [MODAL_DEPLOY.md](./MODAL_DEPLOY.md).

## Testing

### Test the Deployed Service

```bash
# Health check
curl https://deth--health.modal.run

# Analyze audio
curl -X POST https://deth--analyze.modal.run \
  -H "Content-Type: application/json" \
  -d '{
    "file_url": "https://r2.deth.us/audio/example.mp3",
    "threshold": 0.3
  }'
```

### Local Testing

```bash
# Start local server
cd scripts/python/vad
modal serve modal_api.py

# Test with local URL
curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{"file_url": "https://example.com/audio.wav"}'
```

For comprehensive testing guide, see [TESTING.md](./TESTING.md).

## Package Publishing

The `lansonai-vadtools` package is published on PyPI. To publish updates:

### Prerequisites

1. Get PyPI API token from https://pypi.org/manage/account/token/
2. Set environment variable:
   ```bash
   export UV_PUBLISH_TOKEN="pypi-your-token-here"
   ```

### Publish Process

```bash
cd scripts/python/vad

# Build package
uv build

# Publish to PyPI
uv publish
```

For detailed publishing instructions, see [README_PACKAGE.md](./README_PACKAGE.md).

## Supported Formats

### Input Formats
- **Audio**: WAV, MP3, M4A, FLAC, OGG
- **Video**: MP4, AVI, MOV, MKV, FLV, WMV, WEBM, M4V (requires ffmpeg)

### Output Formats
- WAV
- FLAC

## Environment Setup

For Python environment setup (pyenv, modal CLI, etc.), see [SETUP_PYTHON_ENV.md](./SETUP_PYTHON_ENV.md).

## Troubleshooting

### Common Issues

1. **Download failures**: Ensure file URL is publicly accessible
2. **Timeout errors**: Increase timeout for large files or optimize parameters
3. **Format not supported**: Check file format compatibility
4. **Modal deployment errors**: Check Modal logs with `modal app logs vad-api`

### Viewing Logs

```bash
# View Modal app logs
modal app logs vad-api
```

## Related Documentation

- [MODAL_DEPLOY.md](./MODAL_DEPLOY.md) - Detailed deployment guide
- [USAGE.md](./USAGE.md) - Python package usage examples
- [TESTING.md](./TESTING.md) - Testing guide
- [SETUP_PYTHON_ENV.md](./SETUP_PYTHON_ENV.md) - Environment setup
- [README_PACKAGE.md](./README_PACKAGE.md) - Package publishing guide
- [CHANGELOG.md](./CHANGELOG.md) - Package version history

## Integration with Main API

The main API integrates with the VAD service via `src/services/vadCliService.ts`, which:
1. Calls the Modal endpoint with audio URL
2. Adapts the response format to match internal types
3. Handles errors and retries

The service is automatically used when creating audio tasks via the main API.
