Metadata-Version: 2.1
Name: giganticode-langmodels
Version: 0.0.4a0
Summary: A toolkit for applying machine learning to large source code corpora
Home-page: http://github.com/giganticode/langmodels
Author: giganticode
Author-email: hlibbabii@gmail.com
License: MIT
Keywords: big large data source code corpus machine learning nlp pytorch torch fastai language modeling
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: fastai (==1.0.57)
Requires-Dist: giganticode-dataprep (==1.0.0-alpha.12)
Requires-Dist: future (==0.18.2)
Requires-Dist: comet-ml (==3.0.2)
Requires-Dist: flatdict (==3.4.0)
Requires-Dist: retrying (==1.3.3)
Requires-Dist: psutil (==5.6.7)
Requires-Dist: tqdm (==4.39.0)
Requires-Dist: jsons (==1.0.0)
Requires-Dist: numpy (==1.17.4)
Requires-Dist: appdirs (==1.4.3)
Requires-Dist: Columnar (==1.3.1)
Requires-Dist: requests (==2.22.0)
Requires-Dist: pysftp (==0.2.9)
Requires-Dist: semver (==2.9.0)
Requires-Dist: jq (==0.1.6)

## **langmodels**

[![Build Status](https://travis-ci.org/giganticode/langmodels.svg?branch=master)](https://travis-ci.org/giganticode/langmodels)

This is a repository for **neural language models (LMs)** trained on a large corpus of source code 
and a toolkit to work with such models. 

You could be interested in using this library if you want to:
* Use existing pre-trained models for tasks such as autocompletion and bug prediction;
* Use the pre-trained models for transfer transfer learning or further fine-tuning;
* Train a model from scratch by choosing one of the wide range of corpus preprocessing choices, 
 neural network (NN) architectures, and training options.

This project uses [fastai](https://www.fast.ai) and 
[pytorch](https://pytorch.org) libraries for NN training/inference. 
For corpus preprocessing [giganticode-dataprep](https://github.com/giganticode/dataprep) is used.

## Quick start

### Prerequisites

* Python version >= 3.6 required! 

### Installation

```shell script
pip install giganticode-langmodels
```

OR to build from source:

```
git clone https://github.com//giganticode/langmodels
cd langmodels
python -m venv langmodels-venv
source langmodels-venv/bin/activate
pip install -r requirements.txt
```

## Using existing pre-trained models
### Loading a default pre-trained model
```python
>>> import langmodels.repository as repo
>>> trained_model = repo.load_default_model()

[langmodels.repository] INFO: Model is not found in cache. Downloading from https://www.inf.unibz.it/~hbabii/pretrained_models/langmodel-large-split_10k_2_1024_191007.112241_-_langmodel-large-split_10k_2_1024_191022.141344 ...
[langmodels.model] DEBUG: Loading model from: /home/hlib/.local/share/langmodels/0.0.1/modelzoo/langmodel-large-split_10k_2_1024_191007.112241_-_langmodel-large-split_10k_2_1024_191022.141344/best.pth ...
[langmodels.model] DEBUG: Using GPU for inference
```

### Other model loading options

**To see which models are available, you can call `list_pretrained_models` function.**

Set `cached` parameter to `True` (default is `False`) to display only cached LMs (e.g. if offline).
```python
>>> import langmodels.repository as repo
>>> repo.list_pretrained_models(cached=False)

  ID                                                                    BPE_MERGES  LAYERS_CONFIG  ARCH      BIN_ENTROPY    TRAINING_TIME_MINUTES_PER_EPOCH  N_EPOCHS  BEST_EPOCH  TAGS                 

  langmodel-large-split_10k_2_1024_191007.112241_-_langmodel-large-spl  10k         1024/2/1024    AWD_LSTM  2.1455788479   1429                             6         5           ['BEST', 'DEFAULT']  
  it_10k_2_1024_191022.141344                                                                                                                                                                           
  langmodel-large-split_10k_3_1024_191007.112257_-_langmodel-large-spl  10k         512/3/1024     AWD_LSTM  2.14730056622  1432                             6         5           []                   
  it_10k_3_1024_191022.134822                                                                                                                                                                           
  langmodel-large-split_10k_2_2048_191007.112249_-_langmodel-large-spl  10k         512/2/2048     GRU       2.19923468325  1429                             6         5           []                   
  it_10k_2_2048_191022.141335                                                                                                                                                                           
  langmodel-large-split_10k_1_512_190926.120146                         10k         512/1/512      AWD_LSTM  2.69019493253  479                              9         8           ['MEDIUM']           
  langmodel-small-split_10k_1_512_190906.154943                         10k         512/1/512      AWD_LSTM  4.73768141172  4                                19        18          ['TINY']             
  dev_10k_1_10_190923.132328                                            10k         10/1/10        AWD_LSTM  9.15688191092  0                                0         -1          ['RANDOM']
```

Use `query_all_models` method to get a list of `ModelDescription` objects
```python
>>> import langmodels.repository as repo
>>> repo.query_all_models()[0]
ModelDescription(id='langmodel-large-split_10k_2_1024_191007.112241_-_langmodel-large-split_10k_2_1024_191022.141344', bpe_merges='10k', layers_config='1024/2/1024', arch='AWD_LSTM', bin_entropy=2.1455788479, training_time_minutes_per_epoch=1429, n_epochs=6, best_epoch=5, tags=['BEST', 'DEFAULT'])
```

**A model can be loaded by tag or by id.**

You can specify if you want to load a model to CPU despite having cuda-supported GPU with `force_use_cpu` parameter 
(defaults to `False`). If cuda-supported GPU is not available, this parameter is disregarded.
```python
>>> trained_model = repo.load_model_with_tag('BEST')

>>> trained_model = repo.load_model_by_id('dev_10k_1_10_190923.132328_new', force_use_cpu=True)
```

Also, you can use a lower-level API to load a model by path :
```python
trained_model = repo.load_from_path('/home/hlib/.local/share/langmodels/0.0.1/modelzoo/dev_10k_1_10_190923.132328_new')
```

## Inference
### Autocompletion

Example

```python
>>> import langmodels.repository as repo
>>> trained_model = repo.load_default_model()
>>> trained_model.feed_text('public static main() { if', extension='java')

# this does not change the state of the model:
>>> predictions = trained_model.predict_next_full_token(n_suggestions=5)
[('(', 0.9334765834402862), ('.', 0.01540983953864937), ('=', 0.008939018331858162), (',', 0.005372771784601065), ('the', 0.00309070517292041)]

# adding more context:
>>> trained_model.feed_text('(', extension='java')
>>> trained_model.predict_next_full_token(n_suggestions=3)
[('(', 0.14554535082422237), ('c', 0.018005003646104294), ('!', 0.01614662429123089)]


# resetting the state of the model (make it forget the context)
>>> trained_model.reset()
>>> trained_model.predict_next_full_token(n_suggestions=5)
[('/', 0.7209196484717589), ('package', 0.27093282656897594), ('import', 0.0007366385365522241), ('.', 0.0005714365190590807), ('public', 0.0003926736567296)]

```


### Bug prediction based on per-line entropies evaluation

An LM can be used to calculate cross-entropies for each line of a file. High values can give an idea about 
unusual/suspicious chunks of code [[1]](#1).

Check section [LM Evaluation](#lm-evaluation) section to learn how to calculate 
cross-entropy for a project/file/string,

Check our [vsc plugin](https://github.com/giganticode/vsc-extension) for highlighting suspicious code.

## Fine-tuning and Transfer learning

**TBD**

## Training from scratch (Not supported on OSx)

### Python API

```python
>>> from langmodels.training.training import train
>>> from langmodels.lmconfig.datamodel import *

>>> train(LMTrainingConfig(corpus=Corpus(path='/path/to/the/dataset')))
```

More parameters to customize corpus pre-processing, NN architecture, and the training process can be specified:

```python
>>> from langmodels.training.training import train
>>> from langmodels.lmconfig.datamodel import *

>>> train(LMTrainingConfig(corpus=Corpus(path='/path/to/the/dataset'), 
                            prep_function=PrepFunction(options=PrepFunctionOptions(no_com=False, no_unicode=True)),
                            arch=GRUArchj(n_layers=2),
                            training=Training(weight_decay=5e-6)
))
```

Below you can see all the default parameters specified explicitly:

```python
>>> from langmodels.lmconfig.datamodel import *
>>> from langmodels.training.training import train

>>> train(LMTrainingConfig(base_model=None, 
                       bs=32, 
                       corpus=Corpus(path=os.path.join(HOME, 'dataset'), extensions="java"), 
                       prep_function=PrepFunction(corpus_api.bpe, ['10k'], 
                                                  PrepFunctionOptions(no_com=False, no_unicode=True, 
                                                                    no_spaces=True, max_str_length=sys.maxsize)), 
                       arch=LstmArch(
                           bidir=False, qrnn=False, emb_sz=1024, n_hid=1024, n_layers=3, 
                           drop=Dropouts(multiplier=0.5, oute=0.02, outi=0.25, outh=0.15, w=0.2, out=0.1), 
                           tie_weights=True, out_bias=True), 
                       bptt=200, 
                       training=Training(
                            optimizer=Adam(betas=(0.9, 0.99)),
                            files_per_epoch=50000,
                            gradient_clip=0.3,
                            activation_regularization=ActivationRegularization(alpha=2., beta=1.), 
                            schedule=RafaelsTrainingSchedule(init_lr=1e-4, mult_coeff=0.5, patience=0,
                                                            max_epochs=50, max_lr_reduction_times=6), 
                            weight_decay=1e-6)
                       )
      )
```

### CLI API

Training can be run from command line as simple as running `train` command passing path to the config in json format 
as `--config` param. To override values in the json file (or default values if `--config` param is not specified), 
you can use `--patch` param.
```shell script
>> langmodels train --config="/path/to/json/config.json" --patch="bs=64,arch.drop.multiplier=3.0"
```

If neither `--config` nor `--patch` params are specified, the training will be running with the default parameters.
The json with the default parameters would look like follows:

```json
{'arch': {'bidir': False,
          'drop': {'multiplier': 0.5,
                   'out': 0.1,
                   'oute': 0.02,
                   'outh': 0.15,
                   'outi': 0.25,
                   'w': 0.2},
          'emb_sz': 1024,
          'n_hid': 1024,
          'n_layers': 3,
          'name': 'lstm',
          'out_bias': True,
          'qrnn': False,
          'tie_weights': True},
 'base_model': None,
 'bptt': 200,
 'bs': 32,
 'config_version': '0.0.3-alpha.0',
 'corpus': {'extensions': 'java', 'path': '/Users/hlib/dataset'},
 'prep_function': {'callable': 'bpe',
                   'options': {'max_str_length': 9223372036854775807,
                               'no_com': False,
                               'no_spaces': True,
                               'no_str': False,
                               'no_unicode': True},
                   'params': ['10k']},
 'training': {'activation_regularization': {'alpha': 2.0, 'beta': 1.0},
              'files_per_epoch': 50000,
              'gradient_clip': 0.3,
              'optimizer': {'betas': [0.9, 0.99], 'name': 'Adam'},
              'schedule': {'init_lr': 0.0001,
                           'max_epochs': 50,
                           'max_lr_reduction_times': 6,
                           'mult_coeff': 0.5,
                           'name': 'rafael',
                           'patience': 0},
              'weight_decay': 1e-06}}
```

Most probably, you would have to override at least the `corpus.path` value.

For more options, run:
```shell script
>> langmodels train --help
```

## LM Evaluation

When training a language model, it is important to be able to evaluate LM's performance.
In this section we describe different ways to do this using `langmodels` library. 
You can also use our [tool](https://github.com/giganticode/lm-powered) to visualize the evaluation.

### Evaluation on a string / file

First, a model can be evaluate on a string with `evaluate_model_on_string` method. Note that the result may differ a lot depending 
on the state of the model. Use methods `reset` and `feed_text` to reset the model 
to initial state and change the context of the model respectively.

```python

>>> import langmodels.repository as repo 
>>> from langmodels.evaluation import evaluate_model_on_string    

>>> model = repo.load_default_model()
>>> evaluate_model_on_string(model, 'public class MyClass {')

{full_token_entropy/ParsedToken: EvaluationResult(
    tokens=['public</t>', 'class</t>', 'MyClass</t>', '{</t>'],
    token_types=['KeyWord', 'KeyWord', 'SplitContainer', 'OpeningCurlyBracket'],
    values=[1.8144783973693848, 3.668722629547119, 0.5620064437389374, 0.2571456730365753], 
    aggregated_value=1.5755882859230042
)}

```

Similarly, `evaluate_model_on_file` will return a list of `Evaluation` object (1 per each line)

### Evaluation on a corpus

Evaluation can be run on a set of files with `evaluate_model_on_path` method

```python
>>> import langmodels.repository as repo 
>>> from langmodels.evaluation import evaluate_model_on_path

>>> model = repo.load_default_model()
>>> evaluate_model_on_path(model, '/path/to/file')

100%|████████████████████████████████████████████████████████████████████████████| 28/28 [00:11<00:00,  2.35it/s]
{full_token_entropy/ParsedToken: (5.859160765187885, 5745)}
```

In `full_token_entropy/ParsedToken`: `full_token_entropy` is a metric used to evaluate the performance; 
`ParsedToken` means that all the tokens were considered when evaluating (See the next section for more details).
Thus, the average full-token-entropy is ~ 5.85 evaluated on 5.7k tokens.

### Specifying metrics

You can specify based on which metrics the model is to be evaluated.

```python
>>> import langmodels.repository as repo 
>>> from langmodels.evaluation import evaluate_model_on_path

>>> model = repo.load_default_model()
>>> evaluate_model_on_path(model, '/path/to/file', metrics={'full_token_entropy', 'mrr'})
```

Possible metric values are `full_token_entropy`, `subtoken_entropy`, `mrr`. Default metric set is `{full_token_entropy}`


## Release Notes

### 0.0.4-alpha.0 (NOT backward-compatible with 0.0.1-alpha.2)

- Config datamodel improvements: 
    - Add possibility to specify SGD optimizer; 
    - Add patience param to training scedule;
    - Add converters between versions of configs;
- Training:
    - Report binary entropy instead of log-base-e entropy;
    - Save more model metrics (size on disk, trainable params, training time per epoch);
    - Do not save model after every epoch by default;
- Evaluation improvements:
    - Return token types in `EvaluationResult`;
    - Add possibility to specify token types to be considered when running evaluation;
    - Trained_model.predict_next_token(): return 1 suggestion by default;
- Add script for new models upload.

### 0.0.1-alpha.2 (NOT backward-compatible with 0.0.1-alpha.1)

- Make downloading model from the repository thread-safe
- Force to specify the extension which corresponds to the type of the code fed into
the `TrainedModel`. **API change**: `trained_model.feed_text(text: str)` -> `trained_model.feed_text(text: str, extension: str)`

### 0.0.1-alpha.1

Make methods of `TrainedModel` that change underlying PyTorch model thread-safe

### 0.0.1-alpha.0

Initial PyPI release

## References

<a id="1">[1]</a> Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A., & Devanbu, P. (2016, May). 
On the" naturalness" of buggy code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) 
(pp. 428-439). IEEE.

