Metadata-Version: 2.4
Name: conette
Version: 0.4.0
Summary: CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.
Author-email: "Etienne Labbé (Labbeti)" <labbeti.pub@gmail.com>
Maintainer-email: "Etienne Labbé (Labbeti)" <labbeti.pub@gmail.com>
Project-URL: Repository, https://github.com/Labbeti/conette-audio-captioning.git
Project-URL: Changelog, https://github.com/Labbeti/conette-audio-captioning/blob/main/CHANGELOG.md
Keywords: audio,deep-learning,pytorch,captioning,audio-captioning
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.11,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nltk==3.8.1
Requires-Dist: omegaconf==2.3.0
Requires-Dist: pip>=25.3
Requires-Dist: pytorch-lightning==1.9.5
Requires-Dist: pyyaml==6.0.1
Requires-Dist: setuptools==67.7.2
Requires-Dist: spacy==3.7.2
Requires-Dist: tensorboard==2.15.1
Requires-Dist: torch==1.13.1
Requires-Dist: torchaudio==0.13.1
Requires-Dist: torchlibrosa==0.1.0
Requires-Dist: torchoutil[extras]~=0.3.0
Requires-Dist: transformers==4.30.2
Provides-Extra: dev
Requires-Dist: black==23.12.1; extra == "dev"
Requires-Dist: flake8==6.1.0; extra == "dev"
Requires-Dist: ipykernel==6.27.1; extra == "dev"
Requires-Dist: pre-commit==3.7.0; extra == "dev"
Requires-Dist: pytest==7.4.3; extra == "dev"
Requires-Dist: ruff>=0.14.3; extra == "dev"
Requires-Dist: twine>=4.0.1; extra == "dev"
Provides-Extra: train
Requires-Dist: aac-datasets==0.4.1; extra == "train"
Requires-Dist: aac-metrics==0.5.4; extra == "train"
Requires-Dist: absl-py==2.0.0; extra == "train"
Requires-Dist: aiohttp==3.9.1; extra == "train"
Requires-Dist: aiosignal==1.3.1; extra == "train"
Requires-Dist: alembic==1.13.0; extra == "train"
Requires-Dist: antlr4-python3-runtime==4.9.3; extra == "train"
Requires-Dist: anyio==4.1.0; extra == "train"
Requires-Dist: argon2-cffi==23.1.0; extra == "train"
Requires-Dist: argon2-cffi-bindings==21.2.0; extra == "train"
Requires-Dist: arrow==1.3.0; extra == "train"
Requires-Dist: asttokens==2.4.1; extra == "train"
Requires-Dist: async-lru==2.0.4; extra == "train"
Requires-Dist: async-timeout==4.0.3; extra == "train"
Requires-Dist: attrs==23.1.0; extra == "train"
Requires-Dist: audiomentations==0.34.1; extra == "train"
Requires-Dist: audioread==3.0.1; extra == "train"
Requires-Dist: autopage==0.5.2; extra == "train"
Requires-Dist: babel==2.13.1; extra == "train"
Requires-Dist: beautifulsoup4==4.12.2; extra == "train"
Requires-Dist: bert-score==0.3.13; extra == "train"
Requires-Dist: black==23.12.1; extra == "train"
Requires-Dist: bleach==6.1.0; extra == "train"
Requires-Dist: blis==0.7.11; extra == "train"
Requires-Dist: bokeh==3.3.2; extra == "train"
Requires-Dist: brotli==1.1.0; extra == "train"
Requires-Dist: cachetools==5.3.2; extra == "train"
Requires-Dist: catalogue==2.0.10; extra == "train"
Requires-Dist: certifi==2023.11.17; extra == "train"
Requires-Dist: cffi==1.16.0; extra == "train"
Requires-Dist: charset-normalizer==3.3.2; extra == "train"
Requires-Dist: click==8.1.7; extra == "train"
Requires-Dist: cliff==4.4.0; extra == "train"
Requires-Dist: cloudpathlib==0.16.0; extra == "train"
Requires-Dist: cmaes==0.10.0; extra == "train"
Requires-Dist: cmd2==2.4.3; extra == "train"
Requires-Dist: colorlog==6.8.0; extra == "train"
Requires-Dist: comm==0.2.0; extra == "train"
Requires-Dist: confection==0.1.4; extra == "train"
Requires-Dist: contourpy==1.2.0; extra == "train"
Requires-Dist: cycler==0.12.1; extra == "train"
Requires-Dist: cymem==2.0.8; extra == "train"
Requires-Dist: daal==2024.0.1; extra == "train"
Requires-Dist: daal4py==2024.0.1; extra == "train"
Requires-Dist: debugpy==1.8.0; extra == "train"
Requires-Dist: decorator==5.1.1; extra == "train"
Requires-Dist: deepspeed==0.9.5; extra == "train"
Requires-Dist: defusedxml==0.7.1; extra == "train"
Requires-Dist: exceptiongroup==1.2.0; extra == "train"
Requires-Dist: executing==2.0.1; extra == "train"
Requires-Dist: fastjsonschema==2.19.0; extra == "train"
Requires-Dist: filelock==3.13.1; extra == "train"
Requires-Dist: flake8==6.1.0; extra == "train"
Requires-Dist: fonttools==4.46.0; extra == "train"
Requires-Dist: fqdn==1.5.1; extra == "train"
Requires-Dist: frozenlist==1.4.0; extra == "train"
Requires-Dist: fsspec==2023.12.2; extra == "train"
Requires-Dist: gensim==4.3.2; extra == "train"
Requires-Dist: google-auth==2.25.2; extra == "train"
Requires-Dist: google-auth-oauthlib==1.1.0; extra == "train"
Requires-Dist: greenlet==3.0.2; extra == "train"
Requires-Dist: grpcio==1.60.0; extra == "train"
Requires-Dist: h5py==3.10.0; extra == "train"
Requires-Dist: hjson==3.1.0; extra == "train"
Requires-Dist: huggingface-hub==0.19.4; extra == "train"
Requires-Dist: hydra-colorlog==1.2.0; extra == "train"
Requires-Dist: hydra-core==1.3.2; extra == "train"
Requires-Dist: hydra-optuna-sweeper==1.2.0; extra == "train"
Requires-Dist: idna==3.6; extra == "train"
Requires-Dist: imageio==2.33.1; extra == "train"
Requires-Dist: importlib-metadata==7.0.0; extra == "train"
Requires-Dist: inflate64==1.0.0; extra == "train"
Requires-Dist: iniconfig==2.0.0; extra == "train"
Requires-Dist: intel-extension-for-pytorch==2.1.0; extra == "train"
Requires-Dist: ipykernel==6.27.1; extra == "train"
Requires-Dist: ipython==8.18.1; extra == "train"
Requires-Dist: ipywidgets==8.1.1; extra == "train"
Requires-Dist: isoduration==20.11.0; extra == "train"
Requires-Dist: jedi==0.19.1; extra == "train"
Requires-Dist: jinja2==3.1.2; extra == "train"
Requires-Dist: joblib==1.3.2; extra == "train"
Requires-Dist: json5==0.9.14; extra == "train"
Requires-Dist: jsonpointer==2.4; extra == "train"
Requires-Dist: jsonschema==4.20.0; extra == "train"
Requires-Dist: jsonschema-specifications==2023.11.2; extra == "train"
Requires-Dist: julius==0.2.7; extra == "train"
Requires-Dist: jupyter==1.0.0; extra == "train"
Requires-Dist: jupyter-client==8.6.0; extra == "train"
Requires-Dist: jupyter-console==6.6.3; extra == "train"
Requires-Dist: jupyter-core==5.5.0; extra == "train"
Requires-Dist: jupyter-events==0.9.0; extra == "train"
Requires-Dist: jupyter-lsp==2.2.1; extra == "train"
Requires-Dist: jupyter-server==2.12.1; extra == "train"
Requires-Dist: jupyter-server-terminals==0.5.0; extra == "train"
Requires-Dist: jupyterlab==4.0.9; extra == "train"
Requires-Dist: jupyterlab-pygments==0.3.0; extra == "train"
Requires-Dist: jupyterlab-server==2.25.2; extra == "train"
Requires-Dist: jupyterlab-widgets==3.0.9; extra == "train"
Requires-Dist: kiwisolver==1.4.5; extra == "train"
Requires-Dist: langcodes==3.3.0; extra == "train"
Requires-Dist: language-tool-python==2.7.1; extra == "train"
Requires-Dist: lazy-loader==0.3; extra == "train"
Requires-Dist: librosa==0.10.1; extra == "train"
Requires-Dist: lightning-utilities==0.10.0; extra == "train"
Requires-Dist: llvmlite==0.41.1; extra == "train"
Requires-Dist: mako==1.3.0; extra == "train"
Requires-Dist: markdown==3.5.1; extra == "train"
Requires-Dist: markupsafe==2.1.3; extra == "train"
Requires-Dist: matplotlib==3.8.2; extra == "train"
Requires-Dist: matplotlib-inline==0.1.6; extra == "train"
Requires-Dist: mccabe==0.7.0; extra == "train"
Requires-Dist: mistune==3.0.2; extra == "train"
Requires-Dist: msgpack==1.0.7; extra == "train"
Requires-Dist: multidict==6.0.4; extra == "train"
Requires-Dist: multivolumefile==0.2.3; extra == "train"
Requires-Dist: murmurhash==1.0.10; extra == "train"
Requires-Dist: mypy-extensions==1.0.0; extra == "train"
Requires-Dist: nbclient==0.9.0; extra == "train"
Requires-Dist: nbconvert==7.12.0; extra == "train"
Requires-Dist: nbformat==5.9.2; extra == "train"
Requires-Dist: nest-asyncio==1.5.8; extra == "train"
Requires-Dist: networkx==3.2.1; extra == "train"
Requires-Dist: ninja==1.11.1.1; extra == "train"
Requires-Dist: nltk==3.8.1; extra == "train"
Requires-Dist: nnaudio==0.3.2; extra == "train"
Requires-Dist: notebook==7.0.6; extra == "train"
Requires-Dist: notebook-shim==0.2.3; extra == "train"
Requires-Dist: numba==0.58.1; extra == "train"
Requires-Dist: numpy==1.26.2; extra == "train"
Requires-Dist: nvidia-cublas-cu11==11.10.3.66; extra == "train"
Requires-Dist: nvidia-cuda-nvrtc-cu11==11.7.99; extra == "train"
Requires-Dist: nvidia-cuda-runtime-cu11==11.7.99; extra == "train"
Requires-Dist: nvidia-cudnn-cu11==8.5.0.96; extra == "train"
Requires-Dist: oauthlib==3.2.2; extra == "train"
Requires-Dist: omegaconf==2.3.0; extra == "train"
Requires-Dist: optuna==2.10.1; extra == "train"
Requires-Dist: overrides==7.4.0; extra == "train"
Requires-Dist: packaging==23.2; extra == "train"
Requires-Dist: pandas==2.1.4; extra == "train"
Requires-Dist: pandocfilters==1.5.0; extra == "train"
Requires-Dist: parso==0.8.3; extra == "train"
Requires-Dist: pathspec==0.12.1; extra == "train"
Requires-Dist: pbr==6.0.0; extra == "train"
Requires-Dist: pexpect==4.9.0; extra == "train"
Requires-Dist: pillow==10.1.0; extra == "train"
Requires-Dist: platformdirs==4.1.0; extra == "train"
Requires-Dist: pluggy==1.3.0; extra == "train"
Requires-Dist: pooch==1.8.0; extra == "train"
Requires-Dist: preshed==3.0.9; extra == "train"
Requires-Dist: prettytable==3.9.0; extra == "train"
Requires-Dist: prometheus-client==0.19.0; extra == "train"
Requires-Dist: prompt-toolkit==3.0.41; extra == "train"
Requires-Dist: protobuf==4.23.4; extra == "train"
Requires-Dist: psutil==5.9.6; extra == "train"
Requires-Dist: ptyprocess==0.7.0; extra == "train"
Requires-Dist: pure-eval==0.2.2; extra == "train"
Requires-Dist: py-cpuinfo==9.0.0; extra == "train"
Requires-Dist: py7zr==0.20.8; extra == "train"
Requires-Dist: pyasn1==0.5.1; extra == "train"
Requires-Dist: pyasn1-modules==0.3.0; extra == "train"
Requires-Dist: pybcj==1.0.2; extra == "train"
Requires-Dist: pycodestyle==2.11.1; extra == "train"
Requires-Dist: pycparser==2.21; extra == "train"
Requires-Dist: pycryptodomex==3.19.0; extra == "train"
Requires-Dist: pydantic==1.10.13; extra == "train"
Requires-Dist: pyemd==1.0.0; extra == "train"
Requires-Dist: pyflakes==3.1.0; extra == "train"
Requires-Dist: pygments==2.17.2; extra == "train"
Requires-Dist: pyparsing==3.1.1; extra == "train"
Requires-Dist: pyperclip==1.8.2; extra == "train"
Requires-Dist: pyppmd==1.1.0; extra == "train"
Requires-Dist: pytest==7.4.3; extra == "train"
Requires-Dist: python-dateutil==2.8.2; extra == "train"
Requires-Dist: python-json-logger==2.0.7; extra == "train"
Requires-Dist: pytorch-lightning==1.9.5; extra == "train"
Requires-Dist: pytorch-ranger==0.1.1; extra == "train"
Requires-Dist: pytz==2023.3.post1; extra == "train"
Requires-Dist: pyyaml==6.0.1; extra == "train"
Requires-Dist: pyzmq==25.1.2; extra == "train"
Requires-Dist: pyzstd==0.15.9; extra == "train"
Requires-Dist: qtconsole==5.5.1; extra == "train"
Requires-Dist: qtpy==2.4.1; extra == "train"
Requires-Dist: referencing==0.32.0; extra == "train"
Requires-Dist: regex==2023.10.3; extra == "train"
Requires-Dist: requests==2.31.0; extra == "train"
Requires-Dist: requests-oauthlib==1.3.1; extra == "train"
Requires-Dist: resampy==0.2.2; extra == "train"
Requires-Dist: rfc3339-validator==0.1.4; extra == "train"
Requires-Dist: rfc3986-validator==0.1.1; extra == "train"
Requires-Dist: rpds-py==0.13.2; extra == "train"
Requires-Dist: rsa==4.9; extra == "train"
Requires-Dist: safetensors==0.4.1; extra == "train"
Requires-Dist: scikit-image==0.22.0; extra == "train"
Requires-Dist: scikit-learn==1.3.2; extra == "train"
Requires-Dist: scikit-learn-intelex==2024.0.1; extra == "train"
Requires-Dist: scipy==1.11.4; extra == "train"
Requires-Dist: send2trash==1.8.2; extra == "train"
Requires-Dist: sentence-transformers==2.2.2; extra == "train"
Requires-Dist: sentencepiece==0.1.99; extra == "train"
Requires-Dist: six==1.16.0; extra == "train"
Requires-Dist: smart-open==6.4.0; extra == "train"
Requires-Dist: sniffio==1.3.0; extra == "train"
Requires-Dist: soundfile==0.12.1; extra == "train"
Requires-Dist: soupsieve==2.5; extra == "train"
Requires-Dist: soxr==0.3.7; extra == "train"
Requires-Dist: spacy==3.7.2; extra == "train"
Requires-Dist: spacy-legacy==3.0.12; extra == "train"
Requires-Dist: spacy-loggers==1.0.5; extra == "train"
Requires-Dist: sqlalchemy==2.0.23; extra == "train"
Requires-Dist: srsly==2.4.8; extra == "train"
Requires-Dist: stack-data==0.6.3; extra == "train"
Requires-Dist: stevedore==5.1.0; extra == "train"
Requires-Dist: tbb==2021.11.0; extra == "train"
Requires-Dist: tensorboard==2.15.1; extra == "train"
Requires-Dist: tensorboard-data-server==0.7.2; extra == "train"
Requires-Dist: terminado==0.18.0; extra == "train"
Requires-Dist: texttable==1.7.0; extra == "train"
Requires-Dist: thinc==8.2.1; extra == "train"
Requires-Dist: threadpoolctl==3.2.0; extra == "train"
Requires-Dist: tifffile==2023.12.9; extra == "train"
Requires-Dist: timm==0.9.12; extra == "train"
Requires-Dist: tinycss2==1.2.1; extra == "train"
Requires-Dist: tokenizers==0.13.3; extra == "train"
Requires-Dist: tomli==2.0.1; extra == "train"
Requires-Dist: torch==1.13.1; extra == "train"
Requires-Dist: torch-optimizer==0.3.0; extra == "train"
Requires-Dist: torchaudio==0.13.1; extra == "train"
Requires-Dist: torchlibrosa==0.1.0; extra == "train"
Requires-Dist: torchmetrics==1.2.1; extra == "train"
Requires-Dist: torchopenl3==1.0.1; extra == "train"
Requires-Dist: torchtext==0.14.1; extra == "train"
Requires-Dist: torchvision==0.14.1; extra == "train"
Requires-Dist: tornado==6.4; extra == "train"
Requires-Dist: tqdm==4.66.1; extra == "train"
Requires-Dist: traitlets==5.14.0; extra == "train"
Requires-Dist: transformers==4.30.2; extra == "train"
Requires-Dist: typer==0.9.0; extra == "train"
Requires-Dist: types-python-dateutil==2.8.19.14; extra == "train"
Requires-Dist: typing-extensions==4.9.0; extra == "train"
Requires-Dist: tzdata==2023.3; extra == "train"
Requires-Dist: uri-template==1.3.0; extra == "train"
Requires-Dist: urllib3==2.1.0; extra == "train"
Requires-Dist: wasabi==1.1.2; extra == "train"
Requires-Dist: wcwidth==0.2.12; extra == "train"
Requires-Dist: weasel==0.3.4; extra == "train"
Requires-Dist: webcolors==1.13; extra == "train"
Requires-Dist: webencodings==0.5.1; extra == "train"
Requires-Dist: websocket-client==1.7.0; extra == "train"
Requires-Dist: werkzeug==3.0.1; extra == "train"
Requires-Dist: widgetsnbextension==4.0.9; extra == "train"
Requires-Dist: xyzservices==2023.10.1; extra == "train"
Requires-Dist: yarl==1.9.4; extra == "train"
Requires-Dist: zipp==3.17.0; extra == "train"
Provides-Extra: test
Dynamic: license-file

<div align="center">

# CoNeTTE model for Audio Captioning

[![](<https://img.shields.io/badge/-Python 3.10-blue?style=for-the-badge&logo=python&logoColor=white>)](https://www.python.org/)
[![](<https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white>)](https://pytorch.org/get-started/locally/)
[![](https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray)](https://black.readthedocs.io/en/stable/)
[![](https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/inference.yaml?branch=main&style=for-the-badge&logo=github)](https://github.com/Labbeti/conette-audio-captioning/actions)

</div>

CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the [corresponding paper on IEEE](https://ieeexplore.ieee.org/document/10603439) (you can also find an older pre-print version on [arXiv here](https://arxiv.org/pdf/2309.00454.pdf)). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/) :) ) during my PhD. A simple interface to test CoNeTTE is available on the [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).

## Training
### Requirements
- Intended for Ubuntu 20.04 only. Requires **java** < 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.
- Recommanded GPU: NVIDIA V100 with 32GB VRAM.
- WavCaps dataset might requires more than 2 TB of disk storage. Other datasets requires less than 50 GB.

### Installation
By default, **only the pip inference requirements are installed for conette**. To install training requirements you need to use the following command:
```bash
python -m pip install conette[train]
```
If you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.

### Download external models and data
These steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).

First, download the ConvNeXt, NLTK and spacy models :
```bash
conette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false
```

Then download the 4 datasets used to train CoNeTTE :
```bash
common_args="data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=cnext_bl_75 post_hdf_name=bl pretag=cnext_bl_75"

conette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}
conette-prepare data=clotho audio_t.src_sr=44100 ${common_args}
conette-prepare data=macs audio_t.src_sr=48000 ${common_args}
conette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000
```

### Train a model
CNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[clotho_cnext_bl] pl=baseline
```

CoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette
```

CoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette
```

Note 1: any training using AC data cannot be exactly reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.
Note 2: paper results are averaged scores over 5 seeds (1234-1238). The default training only uses seed 1234.

## Inference only (without training)

### Installation
```bash
python -m pip install conette[test]
```

### Usage with command line
Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.

```bash
conette-predict --audio "/your/path/to/audio.wav"
```

### Usage with python
```py
from conette import CoNeTTEConfig, CoNeTTEModel

config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)

path = "/your/path/to/audio.wav"
outputs = model(path)
candidate = outputs["cands"][0]
print(candidate)
```

The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:

```py
import torchaudio

path_1 = "/your/path/to/audio_1.wav"
path_2 = "/your/path/to/audio_2.wav"

audio_1, sr_1 = torchaudio.load(path_1)
audio_2, sr_2 = torchaudio.load(path_2)

outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
candidates = outputs["cands"]
print(candidates)
```

The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".

```py
outputs = model(path, task="clotho")
candidate = outputs["cands"][0]
print(candidate)

outputs = model(path, task="audiocaps")
candidate = outputs["cands"][0]
print(candidate)
```

### Performance
The model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :

| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_audiocaps_test.yaml) |
| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_clotho_eval.yaml) |

This model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.

### Limitations
- The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.
- The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.

## Citation
The final version of the paper describing CoNeTTE is available on IEEExplore: https://ieeexplore.ieee.org/document/10603439. A preprint version of the paper is also available on arXiv: https://arxiv.org/pdf/2309.00454.pdf.

**Final version recommanded for citation (IEEE):**
```bibtex
@article{labbe2023conetteieee,
	title        = {CoNeTTE: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding},
	author       = {Labbé, Étienne and Pellegrini, Thomas and Pinquier, Julien},
	year         = 2024,
	journal      = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
	volume       = 32,
	number       = {},
	pages        = {3785--3794},
	doi          = {10.1109/TASLP.2024.3430813},
	url          = {https://ieeexplore.ieee.org/document/10603439},
	keywords     = {Decoding;Task analysis;Transformers;Training;Convolutional neural networks;Speech processing;Tagging;Audio-language task;automated audio captioning;dataset biases;task embedding;deep learning}
}
```

**Preprint version (arXiv):**
```bibtex
@misc{labbe2023conettearxiv,
	title        = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding},
	author       = {Étienne Labbé and Thomas Pellegrini and Julien Pinquier},
	year         = 2023,
	journal      = {arXiv preprint arXiv:2309.00454},
	url          = {https://arxiv.org/pdf/2309.00454.pdf},
	eprint       = {2309.00454},
	archiveprefix = {arXiv},
	primaryclass = {cs.SD}
}
```

## Additional information
- CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.
- Raw model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/records/10987498 under the filename "convnext_tiny_465mAP_BL_AC_75kit.pth".

## Contact
Maintainer:
- [Étienne Labbé](https://labbeti.github.io/) "Labbeti": labbeti.pub@gmail.com
