Metadata-Version: 2.1
Name: geoLid
Version: 1.0
Summary: Geographically-informed language identification
Home-page: https://github.com/jonathandunn/geoLid
Author: Jonathan Dunn
Author-email: jedunn@illinois.edu
License: GNU GENERAL PUBLIC LICENSE v3
Keywords: lid,language identification,geographic,geography
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: clean-text
Requires-Dist: fasttext

# geoLid
Geographically-informed language identification

This Python package carries out language identification with geographic priors to increase performance for low-resource and under-represented languages.

A description and evaluation of this approach can be found here: https://jdunn.name/2024/03/13/geographically-informed-language-identification/

A complete list of language codes and names per regional model can be found in the *language_names* directory.

**Downloading models**

geoLid contains a baseline non-geographic model as well as models for 16 specific regions, as shown below:

    baseline (916 languages)
    africa_north (44 languages)
    africa_southern (58 languages)
    africa_sub (166 languages)
    america_brazil (88 languages)
    america_central (188 languages)
    america_north (68 languages)
    america_south (129 languages)
    asia_central (54 languages)
    asia_east (46 languages)
    asia_south (60 languages)
    asia_southeast (325 languages)
    europe_east (65 languages)
    europe_russia (65 languages)
    europe_west (108 languages)
    middle_east (53 languages)
    oceania (49 languages)

To download models, use this command:

    from geoLid import download_model
    download_model("baseline")

The model name "all" will download all region-specific models.

**Usage**

Language identification can be used as shown below:

    from geoLid import geoLid
    lid = geoLid(model_location = "models")
    labels = lid.predict(data = data, region = "baseline")

The *model_location* during initialization points to the directory containing the LID models.

The input variable *data* is a list containing at least one string that represents a text to make predictions about.

The *region* variable indicates which region-specific model should be used. The default is to use the non-geographic baseline model.
