Metadata-Version: 2.1
Name: meguru-tokenizer
Version: 0.1.0
Summary: simple tokenizer for tensorflow 2.x and PyTorch
Home-page: https://github.com/MokkeMeguru/meguru_tokenizer
Author: MokkeMeguru
Author-email: meguru.mokke@gmail.com
License: MIT license
Description: # meguru tokenizer
        
        # installation and initialization
        
            pip install meguru_tokenizer
            sudachipy link -t full
        
        # Abstruction of Usage
        
        1.  Preprocess Using Each Tokenizer
            e.g. sentencepiece preprocess / sudachi preprocess
        2.  Tokenize in your code using its Tokenizer
            - basis
              see. [official docs](https://mokkemeguru.github.io/meguru_tokenizer/index.html)
            - TODO: Tensorflow
            - TODO: PyTorch
        
        # RealWorld Example
        
            import meguru_tokenizer.whitespace_tokenizer import WhitespaceTokenizer
            import pprint
        
            sentences = [
                "Hello, I don't know how to use it?",
                "Tensorflow is awesome!",
                "it is good framework.",
            ]
        
            # define tokenizer and vocaburary
            tokenizer = WhitespaceTokenizer(lower=True)
            vocab = Vocab()
        
            # build vocaburary
            for sentence in sentences:
                vocab.add_vocabs(tokenizer.tokenize(sentence))
            vocab.build_vocab()
        
            # set vocaburary into tokenizer to enable encoding
            tokenizer.vocab = vocab
        
            # save vocaburary information
            vocab.dump_vocab(Path("vocab.txt"))
            print("vocabs:")
            pprint.pprint(vocab.i2w)
        
            # tokenize
            print("tokenized sentence")
            pprint.pprint(tokenizer.tokenize_list(sentences))
        
            # [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
            #  ['tensorflow', 'is', 'awesome', '!'],
            #  ['it', 'is', 'good', 'framework', '.']]
        
            # encode
            print("encoded sentence")
            pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
        
            # [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
        
            print("decoded sentence")
            pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
            # ["hello , i do n't know how to use it ?",
            #  'tensorflow is awesome !',
            #  'it is good framework .']
        
            vocab_size = len(vocab)
        
            # restore the vocaburary from dumped file
            print("reload from dump file")
            vocab = Vocab()
            vocab.load_vocab(Path("vocab.txt"))
            assert vocab_size == len(vocab)
        
            tokenizer = WhitespaceTokenizer(vocab=vocab)
            pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
        
            # [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
        
            # vocaburary with minimum frequency limitation
            vocab = Vocab()
            for sentence in sentences:
                vocab.add_vocabs(tokenizer.tokenize(sentence))
            vocab.build_vocab(min_freq=2)
            assert vocab_size != len(vocab)
        
            # vocaburary with maximum voaburary size
            vocab = Vocab()
            for sentence in sentences:
                vocab.add_vocabs(tokenizer.tokenize(sentence))
            vocab.build_vocab(vocab_size=10)
            assert 10 == len(vocab)
        
Keywords: tensorflow,pytorch,tokenizer,nlp
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5
Description-Content-Type: text/markdown
