Metadata-Version: 2.1
Name: chariot
Version: 0.5.2
Summary: Deliver the ready-to-train data to your NLP model.
Home-page: https://github.com/chakki-works/chariot
Author: icoxfog417
Author-email: icoxfog417@yahoo.co.jp
License: Apache License 2.0
Description: # chariot
        
        [![PyPI version](https://badge.fury.io/py/chariot.svg)](https://badge.fury.io/py/chariot)
        [![Build Status](https://travis-ci.org/chakki-works/chariot.svg?branch=master)](https://travis-ci.org/chakki-works/chariot)
        [![codecov](https://codecov.io/gh/chakki-works/chariot/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/chariot)
        
        **Deliver the ready-to-train data to your NLP model.**
        
        
        * Prepare Dataset
          * You can prepare typical NLP datasets through the [chazutsu](https://github.com/chakki-works/chazutsu).
        * Build & Run Preprocess
          * You can build the preprocess pipeline like [scikit-learn Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).
          * Preprocesses for each dataset column are executed in parallel by [Joblib](https://pythonhosted.org/joblib/index.html).
          * Multi-language text tokenization is supported by [spaCy](https://spacy.io/).
        * Format Batch
          * Sampling a batch from preprocessed dataset and format it to train the model (padding etc).
          * You can use pre-trained word vectors through the [chakin](https://github.com/chakki-works/chakin).
        
        **chariot** enables you to concentrate on training your model!
        
        ![chariot flow](./docs/images/chariot_feature.gif)
        
        ## Install
        
        ```
        pip install chariot
        ```
        
        ## Prepare dataset
        
        You can download various dataset by using [chazutsu](https://github.com/chakki-works/chazutsu).  
        
        ```py
        import chazutsu
        from chariot.storage import Storage
        
        
        storage = Storage("your/data/root")
        r = chazutsu.datasets.MovieReview.polarity().download(storage.data_path("raw"))
        
        df = storage.chazutsu(r.root).data()
        df.head(5)
        ```
        
        Then
        
        ```
        	polarity	review
        0	0	synopsis : an aging master art thief , his sup...
        1	0	plot : a separated , glamorous , hollywood cou...
        2	0	a friend invites you to a movie . this film wo...
        ```
        
        `Storage` class manage the directory structure that follows [cookie-cutter datascience](https://drivendata.github.io/cookiecutter-data-science/).
        
        ```
        Project root
          └── data
               ├── external     <- Data from third party sources (ex. word vectors).
               ├── interim      <- Intermediate data that has been transformed.
               ├── processed    <- The final, canonical datasets for modeling.
               └── raw          <- The original, immutable data dump.
        ```
        
        ## Build & Run Preprocess
        
        ### Build a preprocess pipeline
        
        All preprocessors are defined at `chariot.transformer`.  
        Transformers are implemented by extending [scikit-learn `Transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html).  
        Because of this, the API of Transformer is familiar to you. And you can mix [scikit-learn's preprocessors](https://scikit-learn.org/stable/modules/preprocessing.html).
        
        ```py
        import chariot.transformer as ct
        from chariot.preprocessor import Preprocessor
        
        
        preprocessor = Preprocessor()
        preprocessor\
            .stack(ct.text.UnicodeNormalizer())\
            .stack(ct.Tokenizer("en"))\
            .stack(ct.token.StopwordFilter("en"))\
            .stack(ct.Vocabulary(min_df=5, max_df=0.5))\
            .fit(train_data)
        
        preprocessor.save("my_preprocessor.pkl")
        
        loaded = Preprocessor.load("my_preprocessor.pkl")
        ```
        
        There is 6 type of transformers are prepared in chariot.
        
        * TextPreprocessor
          * Preprocess the text before tokenization.
          * `TextNormalizer`: Normalize text (replace some character etc).
          * `TextFilter`: Filter the text (delete some span in text stc).
        * Tokenizer
          * Tokenize the texts.
          * It powered by [spaCy](https://spacy.io/) and you can choose [MeCab](https://github.com/taku910/mecab) or [Janome](https://github.com/mocobeta/janome) for Japanese.
        * TokenPreprocessor
          * Normalize/Filter the tokens after tokenization.
          * `TokenNormalizer`: Normalize tokens (to lower, to original form etc).
          * `TokenFilter`: Filter tokens (extract only noun etc).
        * Vocabulary
          * Make vocabulary and convert tokens to indices.
        * Formatter
          * Format (preprocessed) data for training your model.
        * Generator
          * Genrate target data to train your (language) model.
        
        ### Build a preprocess for dataset
        
        When you want to make preprocess to each of your dataset column, you can use `DatasetPreprocessor`.
        
        ```py
        from chariot.dataset_preprocessor import DatasetPreprocessor
        from chariot.transformer.formatter import Padding
        
        
        dp = DatasetPreprocessor()
        dp.process("review")\
            .by(ct.text.UnicodeNormalizer())\
            .by(ct.Tokenizer("en"))\
            .by(ct.token.StopwordFilter("en"))\
            .by(ct.Vocabulary(min_df=5, max_df=0.5))\
            .by(Padding(length=pad_length))\
            .fit(train_data["review"])
        dp.process("polarity")\
            .by(ct.formatter.CategoricalLabel(num_class=3))
        
        
        preprocessed = dp.preprocess(data)
        
        # DatasetPreprocessor has multiple preprocessor.
        # Because of this, save file format is `tar.gz`.
        dp.save("my_dataset_preprocessor.tar.gz")
        
        loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")
        ```
        
        ## Train your model with chariot
        
        `chariot` has feature to traing your model.
        
        ```py
        formatted = dp(train_data).preprocess().format().processed
        
        model.fit(formatted["review"], formatted["polarity"], batch_size=32,
                  validation_split=0.2, epochs=15, verbose=2)
        
        ```
        
        ```py
        for batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10):
            model.train_on_batch(batch["review"], batch["polarity"])
        
        ```
        
        You can use pre-trained word vectors by [chakin](https://github.com/chakki-works/chakin).  
        
        
        ```py
        from chariot.storage import Storage
        from chariot.transformer.vocabulary import Vocabulary
        
        # Download word vector
        storage = Storage("your/data/root")
        storage.chakin(name="GloVe.6B.50d")
        
        # Make embedding matrix
        vocab = Vocabulary()
        vocab.set(["you", "loaded", "word", "vector", "now"])
        embed = vocab.make_embedding(storage.data_path("external/glove.6B.50d.txt"))
        print(embed.shape)  # (len(vocab.count), 50)
        ```
        
Keywords: machine learning,nlp,natural language processing
Platform: UNKNOWN
Description-Content-Type: text/markdown
