Metadata-Version: 2.1
Name: gcgc
Version: 0.9.2.dev1
Summary: GCGC is a preprocessing library for biological sequence model development.
Home-page: https://github.com/tshauck/gcgc
Author: Trent Hauck
Author-email: trent@trenthauck.com
License: MIT
Description: # GCGC
        
        > GCGC is a python package for feature processing on Biological Sequences.
        
        [![](https://img.shields.io/pypi/v/gcgc.svg)](https://pypi.python.org/pypi/gcgc)
        [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2329966.svg)](https://doi.org/10.5281/zenodo.2329966)
        
        ## Installation
        
        Install GCGC via pip:
        
        ```sh
        $ pip install gcgc
        ```
        
        ## Documentation
        
        The GCGC documentation is at [gcgc.trenthauck.com](http://gcgc.trenthauck.com),
        please see it for an example.
        
        ## Citing GCGC
        
        If you use GCGC in your research, cite it with the following:
        
        ```
        @misc{trent_hauck_2018_2329966,
          author       = {Trent Hauck},
          title        = {GCGC},
          month        = dec,
          year         = 2018,
          doi          = {10.5281/zenodo.2329966},
          url          = {https://doi.org/10.5281/zenodo.2329966}
        }
        ```
        
        
        # Changelog
        
        ## 0.10.0 (2019-11-09)
        
        `gcgc` has been revamped quite a bit to better support existing processing
        pipelines for NLP without trying to do to much. See the docs for more
        information about how this works.
        
        ## 0.9.0 (2019-08-05)
        
        ### Added
        
        - Parser now outputs the length of the tensor not including padding. This is
          useful for packing and length based iteration.
        - Generating masked output from the parse_record method is now available.
        - Alphabet can include an optional mask token.
        
        ### Changed
        
        - Can now specify how large of kmer step size to generate when supplying a kmer
          value.
        - Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
          takes a kmer_step_size to specify how large of steps to take when encoding.
        - Add parsed_seq_len to the SequenceParser object to control how much padding to
          apply to the end of the integer encoded sequence. This is useful since a batch
          of tensors is expected to have the same size.
        
        ## 0.8.0 (2019-07-04)
        
        ### Fixed
        
        - Broken test due to platform differences in `Path.glob` sorting.
        
        ### Added
        
        - User can specify to use start or end tokens optionally.
        
        ### Removed
        
        - Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
          see `scatter` in PyTorch.
        
        ## 0.7.0 (2019-06-22)
        
        ### Added
        
        - Properties to access the integer encodings of special tokens. (35cae2a)
          - `Alphabet.encoded_start`
          - `Alphabet.encoded_end`
          - `Alphabet.encoded_padding`
        - Remove uniprot dataset creation. (e233162)
        - Simplify index handling for GenomicDataset. (3213a9e)
        
        ## 0.6.1 (2019-06-10)
        
        ### Added
        
        - Updated package management so gcgc is easier to use with other version of
          torch.
        
        ## 0.6.0 (2019-04-04)
        
        ### Added
        
        - Ability for kmer size to be passed to an alphabet.
        
        ## 0.5.2 (2019-03-21)
        
        ### Added
        
        - Add Dockerfile and docker-compose.yml for development.
        - `EncodedSeq.shift`, which will shift sequence by an offset integer.
        - `EncodedSeq.from_integer_encoded_seq` will take a list of integers and an
          alphabet and return an EncodedSeq object.
        - Add the ability to apply a function to the rollout_kmers yielded values.
        
        ### Changed
        
        - Alphabet special characters are now located at the start, rather than the end,
          of the letters and token sequence.
        
        ## 0.5.1 (2019-01-09)
        
        ### Added
        
        - Add extra css to make underline links in articles.
        - Exit if the download directory doesn't exist in the call to download organism.
        - Wording improvements in docs.
        
        ## 0.5.0 (2018-12-31)
        
        ### Added
        
        - Include `seq_tensor_one_hot` in the PyTorch Parser.
        - Added a `GCGCRecord.encoded_seq` property.
        - New `gcgc.random` module to start holding sequence data.
        - New `gcgc.rollout` module to handle working through chunks of sequences.
          - `rollout_kmers` will roll out [kmers][1].
          - `rollout_seq_features` will roll out the `SeqFeatures` from a `SeqRecord`.
        - `EncodingAlphabet` now can optionally take a `gap_characters` set of characters to add to the
          alphabet letters. It also takes `add_lower_case_for_inserts` which will duplicate the alphabet,
          but convert the letters to lowercase.
        
        ### Changed
        
        ### Fixed
        
        - Fixed bug in `GenomicDataset.from_path` where it still referred to `init_from_path_generator`.
        
        ## 0.4.0
        
        ### Added
        
        - `EncodedSeq` now supports iterating through kmers, see `EncodedSeq.rollout_kmers` for options.
        - GCGC is citable.
        - GCGC now has a CHANGELOG.md.
        
        [1]: https://en.wikipedia.org/wiki/K-mer
        
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown
