Metadata-Version: 2.3
Name: llama-jarvis
Version: 0.1.0
Summary: Train a speech-to-speech model using your own language model
Project-URL: Documentation, https://github.com/johnsutor/llama-jarvis#readme
Project-URL: Issues, https://github.com/johnsutor/llama-jarvis/issues
Project-URL: Source, https://github.com/johnsutor/llama-jarvis
Author-email: John Sutor <johnsutor3@gmail.com>
License-Expression: MIT
License-File: LICENSE.txt
Keywords: llama,llm,speech-to-speech,transformers
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.8
Requires-Dist: torch
Requires-Dist: transformers
Provides-Extra: dev
Requires-Dist: hatch; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Description-Content-Type: text/markdown

# 🦙🎤 Llama-Jarvis
![Lint Status](https://github.com/johnsutor/llama-jarvis/workflows/Lint/badge.svg)
![Tests Status](https://github.com/johnsutor/llama-jarvis/workflows/Test/badge.svg)
![contributions welcome](https://img.shields.io/badge/contributions-welcome-blue.svg?style=flat)

![alt text](./assets/llama.webp)
Train a speech-to-speech model using your own language model. Currently based on the [Seamless Model](https://huggingface.co/collections/facebook/seamless-communication-6568d486ef451c6ba62c7724), but plan to support more models in the future.

This model is based on speech-to-speech models such as [Llama-Omni](https://github.com/ictnlp/LLaMA-Omni). However, it aims to take advantage of the joint speech-text embeddings of the Seamless Model.

This code is very much a work in progress. Any and all contributions are welcome!  

## Why this Library? 
This library aims to make speech-to-speech models more compatible with the HuggingFace ecosystem, rather than requiring you to modify your models and datasets to work with a new library. This allows us to take advantage of things like the [HuggingFace Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer).

## Getting Started
**NOTE** For some of the below, you may have to first [log in to HuggingFace](https://huggingface.co/docs/huggingface_hub/main/package_reference/authentication) to gain access to the gated models (especially Llama models).  

### Running Locally 
This code is not yet available via PyPi (I am hesitant to release it without thoroughly testing the code). Thus, to try it locally, please run
```shell 
git clone https://github.com/johnsutor/llama-jarvis
cd llama-jarvis 
pip install -e . 
```

### Phase One Loss
The example code will return the phase one loss (i.e., when training the first phase of Llama-Omni) 
```py 
from llama_jarvis.model import JarvisModel, JarvisConfig, JarvisProcessor

BASE_LLM = "meta-llama/Llama-3.2-1B"
SEAMLESS_MODEL = "facebook/hf-seamless-m4t-medium"
LANGUAGE = "eng"

jarvis_config = JarvisConfig(
    BASE_LLM,
    SEAMLESS_MODEL
)
jarvis_model = JarvisModel(jarvis_config)
jarvis_processor = JarvisProcessor(
    BASE_LLM,
    SEAMLESS_MODEL
)

inputs = processor(
    instruction=["You are a language model who should respond to my speech"],
    text=["What is two plus two?"],
    label=["Two plus two is four"],
    src_lang=LANGUAGE,
    return_tensors="pt",
    padding=True
)

outputs = model.forward(
    **inputs,
    tgt_lang=LANGUAGE
)

print(output.loss)
```

### Phase One Two
The example code will return the phase two loss (i.e., when training the second phase of Llama-Omni) 
```py 
from llama_jarvis.model import JarvisModel, JarvisConfig, JarvisProcessor

BASE_LLM = "meta-llama/Llama-3.2-1B"
SEAMLESS_MODEL = "facebook/hf-seamless-m4t-medium"
LANGUAGE = "eng"

jarvis_config = JarvisConfig(
    BASE_LLM,
    SEAMLESS_MODEL
)
jarvis_model = JarvisModel(jarvis_config)
jarvis_processor = JarvisProcessor(
    BASE_LLM,
    SEAMLESS_MODEL
)

inputs = processor(
    instruction=["You are a language model who should respond to my speech"],
    text=["What is two plus two?"],
    label=["Two plus two is four"],
    src_lang=LANGUAGE,
    return_tensors="pt",
    padding=True
)

outputs = model.forward(
    **inputs,
    tgt_lang=LANGUAGE,
    train_phase=2
)

print(output.loss)
```

## Roadmap
- [x] Release the code on PyPi 
- [ ] Train a baseline model using Llama 3.2 1B and Seamless Medium
- [ ] Provide training example code 
- [ ] Fully document the code 
- [ ] Create an inference script for the model
- [ ] Write thorough tests for the code, and test with a multitude of open-source models 

## Other Cool Libraries 
We take a lot of inspiration from some other nice open-source libraries out there. Shoutout to 
- [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM?tab=readme-ov-file)
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
- [Llama-Omni](https://github.com/ictnlp/LLaMA-Omni?tab=readme-ov-file)