Metadata-Version: 2.1
Name: mahaNLP
Version: 0.9
Summary: An NLP Library for Marathi Language
Home-page: https://github.com/l3cube-pune/MarathiNLP.git
Author: L3Cube
Author-email: ravirajoshi@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: importlib-resources
Requires-Dist: huggingface-hub (==0.11.1)
Requires-Dist: tqdm
Requires-Dist: pandas
Requires-Dist: sentence-transformers
Requires-Dist: transformers
Requires-Dist: numpy
Requires-Dist: torch
Requires-Dist: IPython

# **mahaNLP**

- **mahaNLP** is a python-based natural language processing library focused on the Indian language **Marathi**. It provides an easy interface for NLP features like sentiment analysis, named entity recognition, hate speech detection, etc. exclusively for Marathi text.

- **L3Cube**, the author of this library aims to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!

- [Github Repo](https://github.com/l3cube-pune/MarathiNLP)
- [Demonstration with examples](https://cutt.ly/f1FYQak)

## **Features:**

##### **This library is customised to be used by a basic programmer and an ML practitioner.**

---

#### **1. Basic Usage:**

This mode of access is designed from a basic programmer point of view and follow simpler way to perform the desired tasks. It provides the following features:

- **Datasets:** Provides the functionality to load the dataset

- **Autocomplete:** Text prediction

- **Preprocess:** Data cleaning

- **Tokenizer:** Tokenizes text

- **Tagger:** Named entity recognision

- **MaskFill:** Predicts the masked tokens

- **Hate:** Detects hate speech

- **Sentiment:** Sentiment analysis

- **Similarity:** Detects similarity

#### **2. Advanced Usage:**

This way of accessing the library is designed from an ML Practitioner's point of view and has more flexibility to choose a model for the desired task.

- **MaskFill Model:** Predicts the masked tokens

- **GPT Model:** Text prediction

- **Hate Model:** Detects hate speech

- **NER Model:** Named entity recognision

- **Sentiment Model:** Sentiment analysis

- **Similarity Model:** Detects similarity

Some of the mentioned models have sub models within them that can be seen using the **listModels()** function.

## **Installation:**

- **pip install mahaNLP==[version]**
  _Eg.: pip install mahaNLP==0.6_

- or we can simply use:
  **_pip install mahaNLP_**

---

## **Few Examples:**

### **1. Tagger (from basic usage point of view)**

Stepwise execution:

- import
  from mahaNLP.mask_fill import MaskPredictor

- create an object
  model = MaskPredictor()

It provides one functionality

- **predict_mask:** Predicts the masked token

* **Example:**

- _pass the string with the word to be predicted replaced with '[MASK]':_
  **text = 'मी महाराष्ट्रात [MASK].'**
  _English Translation:
  'I in Maharashtra [MASK]'_
- **model.predict_mask(text)**

- The output will contain some predictions like:

  - मी महाराष्ट्रात **आहे**.
  - मी महाराष्ट्रात **राहणार**.
  - मी महाराष्ट्रात **नाही**.
  - मी महाराष्ट्रात**च**.
  - मी महाराष्ट्रात **राहतो**.

- There are some optional parameters:

  - **details** (minimum, medium, all) in string - Default: minimum
    - Used to pass the detailedness to be considered
  - **as_dict** (True, False) in boolean - Default: False
    - Used to define the print type

- Example:
  - model.predict_mask(text9, 'all', True)
  - Output:
    [{'score': 0.46560075879096985, 'token': 1155, 'token_str': 'आहे', 'sequence': 'मी महाराष्ट्रात आहे.'},
    {'score': 0.07969045639038086, 'token': 92222, 'token_str': 'राहणार', 'sequence': 'मी महाराष्ट्रात राहणार.'},
    {'score': 0.07400081306695938, 'token': 1826, 'token_str': 'नाही', 'sequence': 'मी महाराष्ट्रात नाही.'},
    {'score': 0.050422605127096176, 'token': 1617, 'token_str': '##च', 'sequence': 'मी महाराष्ट्रातच.'},
    {'score': 0.04373728483915329, 'token': 62560, 'token_str': 'राहतो', 'sequence': 'मी महाराष्ट्रात राहतो.'}]

### **2. Sentiment (from advance usage point of view)**

Stepwise execution:

- import
  from mahaNLP.model_repo import SentimentModel

- list the available models
  - modelSentiment.list_models()
  - Output:
    - sentiment models: MarathiSentiment : l3cube-pune/MarathiSentiment
    - tagger models: marathi-ner : l3cube-pune/marathi-ner
    - autocomplete models: marathi-gpt : l3cube-pune/marathi-gpt
    - similarity models: marathi-sentence-similarity-sbert : l3cube-pune/marathi-sentence-similarity-sbert
      marathi-sentence-bert-nli : l3cube-pune/marathi-sentence-bert-nli
    - mask_fill models: marathi-bert-v2 : l3cube-pune/marathi-bert-v2
      marathi-roberta : l3cube-pune/marathi-roberta marathi-albert : l3cube-pune/marathi-albert
    - hate models: mahahate-bert : l3cube-pune/mahahate-bert
      mahahate-multi-roberta : l3cube-pune/mahahate-multi-roberta

The library lists down the models available for all the models. These can be changed by the user.

**To change the default model:**
Pass the name of the model as the argument:
modelSentiment = SentimentModel('name of model')
Eg.: modelSentiment = SentimentModel('MarathiSentiment')

- Sentiment provides one functionality
  - **get_polarity_score:** Gives the polarity score of words in a sentence along with the tokens (Neutral, Positive, Negative)
  - Example:
    text = 'दिवाळीच्या सणादरम्यान सगळे आनंदी असतात.'
    _English Translation:
    'Everyone is happy during Diwali festival.'_
  - modelSentiment.get_polarity_score(text)
  - Output:
    label: Positive
    score: 0.995338

---

**Entire working of mahaNLP is explained in this [demo file](https://cutt.ly/f1FYQak). Please have a look at it to get a better idea!**

## Citing

```
@article{joshi2022l3cube_mahanlp,
  title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
  author={Joshi, Raviraj},
  journal={arXiv preprint arXiv:2205.14728},
  year={2022}
}
```

Thank you<br>
Team L3Cube

---
