Metadata-Version: 2.1
Name: lingualytics
Version: 0.1.2
Summary: A multilingual text analytics package.
Home-page: https://github.com/lingualytics/py-lingualytics
Author: Rohan Rajpal
Author-email: rohan46000@gmail.com
License: MIT
Project-URL: Documentation, https://github.com/lingualytics/py-lingualytics/blob/master/README.md
Project-URL: Source Code, https://github.com/lingualytics/py-lingualytics
Project-URL: Bug Tracker, https://github.com/lingualytics/py-lingualytics/issues
Keywords: text mining,text preprocessing,text representation,text visualization,codemix analytics
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Natural Language :: English
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.6.1
Description-Content-Type: text/markdown
Requires-Dist: torch (>=1.6)
Requires-Dist: numpy (>=1.17)
Requires-Dist: tabulate (>=0.8.7)
Requires-Dist: pandas (>=1.0.2)
Requires-Dist: transformers (>=3.0)
Requires-Dist: tqdm (>=4.48.2)
Requires-Dist: scikit-learn (>=0.22)
Requires-Dist: texthero (>=1.0)

# Lingualytics : Easy codemixed analytics

Lingualytics is a Python library for dealing with code mixed text.  
Lingualytics is powered by powerful libraries like [Pytorch](https://pytorch.org/), [Transformers](https://huggingface.co/transformers), [Texthero](https://texthero.org/), [NLTK](http://www.nltk.org/) and [Scikit-learn](https://scikit-learn.org/).

## Features

1. Preprocessing
    - Remove stopwords
    - Remove punctuations, with an option to add punctuations of your own language
    - Remove words less than a character limit

2. Representation
    - Find n-grams from given text

3. NLP
    - Classification using PyTorch
        - Train a classifier on your data to perform tasks like Sentiment Analysis
        - Evaluate the classifier with metrics like accuracy, f1 score, precision and recall
        - Use the trained tokenizer to tokenize text
    - Some pretrained Huggingface models trained on codemixed datasets you can use
        - [bert-base-multilingual-codemixed-cased-sentiment](https://huggingface.co/rohanrajpal/bert-base-multilingual-codemixed-cased-sentiment)

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install lingualytics.

```bash
pip install lingualytics
```

## Usage

### Preprocessing

```python
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits
import pandas as pd
df = pd.read_csv(
   "https://github.com/lingualytics/py-lingualytics/raw/master/datasets/SAIL_2017/Processed_Data/Devanagari/validation.txt", header=None, sep='\t', names=['text','label']
)
# pd.set_option('display.max_colwidth', None)
df['clean_text'] = df['text'].pipe(remove_digits) \
                                    .pipe(remove_punctuation) \
                                    .pipe(remove_lessthan,length=3) \
                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
print(df)
```

### Classification

The train data path should have 4 files
    - train.txt
    - validation.txt
    - test.txt

You can just download `datasets/SAIL_2017/Processed Data/Devanagari` from the Github repository to try this out.

```python
from lingualytics.learner import Learner

learner = Learner(data_dir='<path-to-train-data>',
                output_dir='<path-to-output-predictions-and-save-the-model>')
learner.fit()
```

### Find topmost n-grams

```python
from lingualytics.representation import get_ngrams
import pandas as pd
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

ngrams = get_ngrams(df['text'],n=2)

print(ngrams[:10])
```

## Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

## License

[MIT](https://choosealicense.com/licenses/mit/)


