Metadata-Version: 2.4
Name: nemoasr2pytorch
Version: 0.1.5
Summary: Pure-PyTorch inference port of NeMo ASR (VAD + Parakeet-TDT) for Windows/WSL
Author: baicai1145
License: MIT License
        
        Copyright (c) 2025 白菜工厂1145号员工
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/baicai1145/nemoasr2pytorch
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.23
Requires-Dist: librosa>=0.10.0
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: PyYAML>=6.0
Dynamic: license-file

# nemoasr2pytorch

Pure‑PyTorch inference port of several NeMo ASR models, with a focus on **Windows / WSL** support and **no NeMo runtime dependency**.

Currently supported:

- Frame‑level VAD: `Frame_VAD_Multilingual_MarbleNet_v2.0`
- ASR (RNNT‑TDT):
  - `parakeet-tdt-0.6b-v2` – English
  - `parakeet-tdt-0.6b-v3` – Multilingual

The project only targets **inference** – no training or data pipelines – and mirrors NeMo’s architecture closely so that results match NeMo as much as possible.

## Installation

1. Install a suitable **PyTorch + torchaudio** build first (GPU or CPU), following the official instructions.  
   For example, on CUDA 12.6:

   ```bash
   pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
     --index-url https://download.pytorch.org/whl/cu126
   ```

2. Then install this package:

   ```bash
   pip install nemoasr2pytorch
   ```

> Torch is **not** pinned as a dependency on purpose – you stay in control of the exact CUDA / CPU build.

## Quick ASR usage (Parakeet‑TDT)

The simplest way to run ASR on a single WAV file:

```python
from nemoasr2pytorch.asr.api import load_default_parakeet_tdt_model, transcribe

# lang="EN" -> parakeet-tdt-0.6b-v2 (English)
# lang="EU" -> parakeet-tdt-0.6b-v3 (multilingual)
model = load_default_parakeet_tdt_model(lang="EU")

text = transcribe(model, "your_audio.wav")
print(text)
```

Details:

- On first use, the corresponding `.pt` weights are automatically downloaded from ModelScope
  and cached under `exports/parakeet_tdt_0.6b_v{2,3}.pt` in your working directory.
- Subsequent runs reuse the local `.pt` directly.

### Low‑precision inference (FP16 / BF16)

On GPU you can load the model directly in low precision to save memory:

```python
from nemoasr2pytorch.asr.api import (
    load_parakeet_tdt_fp16,
    load_parakeet_tdt_bf16,
    transcribe_amp,
)

# FP16 model (GPU only)
model_fp16 = load_parakeet_tdt_fp16(lang="EU")
print("FP16:", transcribe_amp(model_fp16, "your_audio.wav"))

# BF16 model (if hardware supports it)
model_bf16 = load_parakeet_tdt_bf16(lang="EU")
print("BF16:", transcribe_amp(model_bf16, "your_audio.wav"))
```

`transcribe_amp` uses PyTorch AMP (`torch.amp.autocast`) on CUDA to run the model in mixed precision.

## VAD (MarbleNet) for pre‑segmentation

Frame‑level VAD API:

```python
from nemoasr2pytorch.vad.api import load_default_frame_vad_model, run_vad_on_waveform
import torchaudio

# Loads MarbleNet VAD; if the .pt is missing, it is auto-downloaded
# from ModelScope to ./exports/frame_vad_multilingual_marblenet_v2.0.pt
vad_model = load_default_frame_vad_model()

waveform, sr = torchaudio.load("your_audio.wav")
if sr != vad_model.preprocessor.sample_rate:
    waveform = torchaudio.functional.resample(
        waveform, sr, vad_model.preprocessor.sample_rate
    )

probs, segments = run_vad_on_waveform(vad_model, waveform.squeeze(0))
print("Segments:", segments)
```

## Long‑audio inference (concept)

The repository version ships a reference script `inference.py` which:

- loads a Parakeet model (v2/v3, chosen by `lang`);
- optionally runs MarbleNet VAD to detect speech regions;
- merges VAD segments into chunks based on `min_seg` / `max_seg` length;
- runs Parakeet on each chunk and concatenates the results.

The core logic is implemented via the public APIs:

- `nemoasr2pytorch.vad.api` – VAD model + `run_vad_on_waveform`
- `nemoasr2pytorch.asr.api` – Parakeet model + `transcribe` / `transcribe_amp`

You can either:

- copy `inference.py` from the GitHub repo and adapt it to your own CLI; or
- re‑implement a similar pipeline in your application using the two APIs above.

## Package APIs

Main public modules:

- `nemoasr2pytorch.asr.api`
  - `load_default_parakeet_tdt_model(lang="EN" | "EU", device=None, dtype=torch.float32)`  
    Load Parakeet‑TDT in FP32; `lang` chooses v2 (EN) vs v3 (EU).  
  - `load_parakeet_tdt_fp16(lang="EN" | "EU", device=None)`  
    Load FP16 model (usually on GPU).  
  - `load_parakeet_tdt_bf16(lang="EN" | "EU", device=None)`  
    Load BF16 model (if supported).  
  - `transcribe(model, audio)`  
    Greedy TDT decoding in full precision (CPU or GPU).  
  - `transcribe_amp(model, audio)`  
    Greedy TDT decoding with AMP on CUDA for low‑precision models.

- `nemoasr2pytorch.vad.api`
  - `load_default_frame_vad_model(device=None, dtype=torch.float32)`  
    Load the MarbleNet VAD model from a local `.pt`.  
  - `run_vad_on_waveform(model, audio, ...)`  
    Compute per‑frame speech probabilities and return merged speech segments.

## Notes / Limitations

- This package focuses on **inference only**; training and NeMo’s full config stack (Hydra/Lightning) are intentionally omitted.
- Parakeet weights (`.pt`) are auto‑downloaded from ModelScope on first use; VAD `.pt` is currently expected to be provided by the user (converted from NeMo).
- For best performance and lower memory usage, a CUDA‑enabled PyTorch build is recommended; CPU‑only inference also works but will be slower on long audio.
