Metadata-Version: 2.1
Name: nagisa-bert
Version: 0.0.1
Summary: A BERT model for nagisa: It is created to be robust against typos and colloquial expressions for Japanese.
Home-page: https://github.com/taishi-i/nagisa_bert
Author: taishi-i
Author-email: taishi.ikeda.0323@gmail.com
Maintainer: taishi-i
Maintainer-email: taishi.ikeda.0323@gmail.com
License: MIT
Keywords: NLP,BERT,Transformers,Japanese
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Natural Language :: Japanese
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# nagisa_bert

This library provides a tokenizer to use [the Japanese BERT model](https://huggingface.co/taishi-i/nagisa_bert) for [nagisa](https://github.com/taishi-i/nagisa).
The nagisa BERT model is created to be **robust against typos and colloquial expressions for Japanese**.

It is trained using character and word units with Hugging Face's Transformers. Unknown words are trained on a character unit.
The model is available in [Transformers](https://github.com/huggingface/transformers) 🤗.

## Install

Python 3.7+ on Linux or macOS is required.
You can install *nagisa_bert* by using the *pip* command.


```bash
$ pip install nagisa_bert
```

## Usage

This model is available in Transformer's pipeline method.

```python
>>> from transformers import pipeline
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
>>> print(fill_mask(text))
[{'score': 0.1437765508890152,
  'sequence': 'n a g i s a で 使用 できる モデル です',
  'token': 1104,
  'token_str': '使 用'},
 {'score': 0.08369122445583344,
  'sequence': 'n a g i s a で 購入 できる モデル です',
  'token': 1821,
  'token_str': '購 入'},
 {'score': 0.07685843855142593,
  'sequence': 'n a g i s a で 利用 できる モデル です',
  'token': 548,
  'token_str': '利 用'},
 {'score': 0.07316956669092178,
  'sequence': 'n a g i s a で 閲覧 できる モデル です',
  'token': 13270,
  'token_str': '閲 覧'},
 {'score': 0.05647417902946472,
  'sequence': 'n a g i s a で 確認 できる モデル です',
  'token': 1368,
  'token_str': '確 認'}]

```

Tokenization and vectorization.

```python
>>> from transformers import BertModel
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['n', 'a', 'g', 'i', 's', 'a', 'で', '[MASK]', 'できる', 'モデル', 'です']

>>> model = BertModel.from_pretrained("taishi-i/nagisa_bert")
>>> h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
>>> print(h)
tensor([[[-1.1636, -0.5645,  0.4484,  ..., -0.2207, -0.1540,  0.1051],
         [-1.0394,  0.8815, -0.8070,  ...,  1.0930,  0.2069,  0.9613],
         [-0.2068, -0.1445, -0.6113,  ..., -1.2920,  0.0725, -0.2164],
         ...,
         [-1.2590,  0.0118,  0.4998,  ..., -0.5212, -0.8015, -0.1050],
         [ 0.7925, -0.7628,  0.1016,  ...,  0.2233,  0.0164,  0.0102],
         [-0.7847, -0.1375,  0.4475,  ..., -0.4014,  0.0346,  0.3157]]],
       grad_fn=<NativeLayerNormBackward0>)
```
