Metadata-Version: 2.1
Name: whisperer-ml
Version: 0.1.2
Summary: Go from raw audio to a text-audio dataset with OpenAI's Whisper
Author: miguelvalente
Author-email: miguelvalente@protonmail.com
Requires-Python: >=3.10,<3.11
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: ffmpeg-python (>=0.2.0,<0.3.0)
Requires-Dist: jupyter (>=1.0.0,<2.0.0)
Requires-Dist: librosa (==0.8.0)
Requires-Dist: numpy (==1.22.3)
Requires-Dist: openai-whisper (>=20230124,<20230125)
Requires-Dist: pyannote-audio (>=2.1.1,<3.0.0)
Requires-Dist: pydub (==0.25.1)
Requires-Dist: scipy (==1.8.0)
Requires-Dist: torch (==1.13.0)
Requires-Dist: torchaudio (==0.13.0)
Requires-Dist: tqdm (>=4.64.1,<5.0.0)
Requires-Dist: transformers (>=4.25.1,<5.0.0)
Requires-Dist: typer (>=0.7.0,<0.8.0)
Description-Content-Type: text/markdown


# whisperer

Go from raw audio files to a speaker separated text-audio datasets automatically.

![plot](https://github.com/miguelvalente/whisperer/blob/master/logo.png?raw=true)


## Table of Contents

- [Summary](#summary)
- [Key Features](#key-features)
- [Instalation](#instalation)
- [How to use:](#how-to-use)
   - [Using Multiple-GPUS](#using-multiple-gpus)
   - [Configuration](#configuration)
- [To Do](#to-do)
- [Acknowledgements](#acknowledgements)

## Summary

This repo takes a directory of audio files and converts them to a text-audio dataset with normalized distribution of audio lengths. *See ```AnalyzeDataset.ipynb``` for examples of the dataset distributions across audio and text length*

The output is a text-audio dataset that can be used for training a speech-to-text model or text-to-speech.
The dataset structure is as follows:
```
│── /dataset
│   ├── metadata.txt
│   └── wavs/
│      ├── audio1.wav
│      └── audio2.wav
```

metadata.txt
```
peters_0.wav|Beautiful is better than ugly.
peters_1.wav|Explicit is better than implicit.

```

## Key Features

* Audio files are automatically split by speakers
* Speakers are auto-labeled across the files
* Audio splits on silences
* Audio splitting is configurable
* The dataset creation is done so that it follows Gaussian-like distributions on clip length. Which, in turn, can lead to Gaussian-like distributions on the rest of the dataset statistics. Of course, this is highly dependent on your audio sources.
* Leverages the GPUs available on your machine. GPUs also be set explicitly if you only want to use some.


## Instalation

Install from PyPi with pip
```
pip install whisperer-ml
```

## How to use:


1. Create data folder and move audio files to it
```
mkdir data data/raw_files
```
2. There are four commands
   1. Convert
      ```
      whisperer-ml convert path/to/data/raw_files
      ```
   2. Diarize 
      ```
      whisperer-ml diarize path/to/data/raw_files
      ```
   3. Auto-Label 
      ```
      whisperer-ml auto-label path/to/data/raw_files number_speakers
      ```
   4. Transcribe 
      ```
      whisperer-ml transcribe path/to/data/raw_files your_dataset_name
      ```


3. Use the ```AnalyseDataset.ipynb``` notebook to visualize the distribution of the dataset
4. Use the ```AnalyseSilence.ipynb``` notebook to experiment with silence detection configuration

### Using Multiple-GPUS

The code automatically detects how many GPU's are available and distributes the audio files in ```data/wav_files``` evenly across the GPUs.
The automatic detection is done through ```nvidia-smi```.

You can to make the available GPU's explicit by setting the environment variable ```CUDA_AVAILABLE_DEVICES```.

### Configuration

Modify `config.py` file to change the parameters of the dataset creation. Including silence detection.
## To Do

- [x] Speech Diarization
- [x] Replace click with typer


## Acknowledgements


 - [AnalyseDataset.ipynb adapted from coqui-ai example](https://github.com/coqui-ai)
 - [OpenAI Whisper](https://github.com/openai/whisper)
 - [PyAnnote](https://github.com/pyannote/pyannote-audio)
 - [SpeechBrain](https://github.com/speechbrain/speechbrain)

