Metadata-Version: 2.1
Name: ml_determination
Version: 0.1.0
Summary: A package for determining the matrix language in bilingual sentences.
Author: Olga Iakovenko
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy
Requires-Dist: numpy
Requires-Dist: torch
Requires-Dist: transformers

# ml_determination
A package for determining the matrix language in bilingual sentences. This is the implementation of the algorithms presented in the paper "Methods for Automatic Matrix Language Determination of Code-Switched Speech"

## Installation

The main functionality can be easily installed into your Python environment using pip:

```
pip install ml_determination
```

## Usage

To predict the matrix language using the package import the library and the matrix language determination classes for text:

```

```

## Citation
If you use ml_determination in your projects, please feel free to cite the original EMNLP paper:

@inproceedings{iakovenko-hain-2024-methods,
    title = "Methods of Automatic Matrix Language Determination for Code-Switched Speech",
    author = "Iakovenko, Olga  and
      Hain, Thomas",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.330/",
    doi = "10.18653/v1/2024.emnlp-main.330",
    pages = "5791--5800",
    abstract = "Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60{\%}) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID."
}
