Metadata-Version: 2.1
Name: kcollections
Version: 1.0.2
Summary: A BloomFilterTrie implementation to be generally applicable for genomic applications.
Home-page: https://github.com/masakistan/kcollections
Author: M. Stanley Fujimoto and Cole A. Lyman
Author-email: sfujimoto@gmail.com
License: GPLv3
Description: # kcollections
        [![Build Status](https://travis-ci.com/masakistan/kcollections.svg?token=oruFeF6Jkw9aGsjG6xUW&branch=master)](https://travis-ci.com/masakistan/kcollections)
        
        A fast and efficient library for storing k-mers in python.
        
        ## About
        `kcollections` is a python-like `dict` and `set` that efficiently store kmers as keys or elements of the collection.
        It is designed for building/prototyping Bioinformatics tools that rely on kmers but where the included `dict` and `set` consume too much memory for use.
        
        It implements the [Bloom Filter Trie](https://github.com/GuillaumeHolley/BloomFilterTrie) algorithm.
        This implementation differs from Guillaume et al. by allowing kmers of arbitrary size and by providing a generic dictionary/map data structure for associating arbitrary values with kmers.
        
        `kcollections` is currently only available for Python version 2.7.
        
        ## Installation
        We provide some pre-compiled binaries for python 2/3 and Linux and MacOS:
        
        ```bash
        pip install kcollections
        ```
        
        Alternatively, you can build from source.
        
        ### Build from source
        Prerequisites include:
        
          - [CMake](https://cmake.org/), must be installed manually in order to build the code.
          - [uint256_t](https://github.com/calccrypto/uint256_t), included in the repository.
          - [pybind11](https://github.com/pybind/pybind11), included in the repository.
        
        To build and install the python module from source:
        
        ```bash
        git clone https://github.com/masakistan/kcollections.git
        cd kcollections
        
        # python 3
        python3 setup.py install
        # or
        pip3 install .
        
        # python 2
        python2 setup.py install
        # or
        pip2 install .
        ```
        
        Additionally, if you would like to access the functions from a C++ program, you can build static libraries with the following steps:
        
        ``` bash
        git clone https://github.com/masakistan/kcollections.git
        cd kcollections
        
        mkdir build
        cd build
        
        cmake ..
        make
        ```
        
        ## Example Usage
        
        ### Using Kset
        
        #### Serial Insertion
        Kmers can be added one at a time with `add`, but the fastest way to add kmers to a set is
        to add a DNA sequence using `add_seq`.
        
        ```python
        import kcollections
        ks = kcollections.Kset(27)
        
        # add single kmer
        ks.add('AAACTGTCTTCCTTTATTTGTTCAGGG')
        
        # sequence insertion
        seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
        ks.add_seq(seq, len(seq))
        
        assert 'AAACTGTCTTCCTTTATTTGTTCAGGG' in ks
        # iteration
        for kmer in ks:
            print kmer
        ```
        
        #### Parallel Insertion
        The fastest way to use `Kset` is to use multithreaded insertion.
        Multithreaded approach is best used when all kmers are loaded upfront.
        Kmers not accessible until the threads have been joined using
        `parallel_add_join`.
        See the example below on how parallel and serial insertions can be used.
        
        ```python
        import kcollections
        ks = kcollections.Kset(27)
        
        # multithreaded sequence insertion
        # nthreads must be a power of 2.
        # nthreads = 4 or 16 work well
        ks.parallel_add_init(16)
        
        # insert a sequence of kmers
        ks.parallel_add_seq(seq)
        
        # insert a single kmer
        ks.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
        
        # merge threads together
        # no parallel add methods can be used after joining
        ks.parallel_add_join()
        
        # serial add is permissible after threads have joined
        ks.add('AAACTGTCTTCCTTTATTTGTTCACAG')
        
        # iteration
        for kmer in ks:
            print(kmer)
        print len(ks)
        ```
        
        ### Using Kdict
        
        #### Serial Insertion
        ```python
        import kcollections
        kd = kcollections.Kdict(27)
        
        # insertion and value assignment
        kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] = 'banana'
        kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] = 'phone'
        assert kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 'banana'
        assert kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] == 'phone'
        
        # iteration
        for kmer, val in kd.items():
            print(kmer, val)
        
        # removal
        del kd['AAACTGTCTTCCTTTATTTGTTCAGGT']
        ```
        
        #### Parallel Insertion
        Parallel insertion for `Kdict` requires the inclusion of a merging function to resolve different values for the same key.
        The following snippet adds 27mers from a string of DNA using a provided lambda function to merge value conflicts.
        This merge function simply keeps the newest value associated with the kmer.
        More examples of merging functions with `Kdict` can be found [here](kdict_merging.md).
        
        ```python
        kd = kcollections.Kdict(27)
        kd.parallel_add_init(4)
        kd.set_merge_func(lambda prev_val, new_val: new_val)
        kd.parallel_add_seq(dna, generate_idx(len(dna)))
        kd.parallel_add_join()
        ```
        
        ### Using Kcounter
        
        `Kcounter` is an implementation of the Python collection's
        [Counter](https://docs.python.org/2/library/collections.html#collections.Counter),
        but the keys must be kmers, of course!
        Like `Kdict`, kmers can be added to `Kcounter` one at a time, but the
        fastest ways to add kmers to a set is to add an DNA sequence using `add_seq` (or
        `parallel_add_seq` for multithreaded inserts).
        
        #### Serial Insertion
        ```python
        from kcollections import Kcounter
        kc = Kcounter(27)
        
        # add single kmer
        kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
        
        # sequence insertion
        seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
        kc.add_seq(seq)
        
        assert kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 2
        
        # iteration
        for kmer, count in kc.items():
            print(kmer, count)
        ```
        
        #### Parallel Insertion
        
        ```python
        from kcollections import Kcounter
        kc = Kcounter(27)
        
        # multithreaded sequence insertion
        # nthreads must be a power of 2.
        # nthreads = 4 or 16 work well
        kc.parallel_add_init(16)
        
        # insert a sequence of kmers
        kc.parallel_add_seq(seq)
        
        # insert a single kmer
        kc.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
        
        # merge threads together
        # no parallel add methods can be used after joining
        kc.parallel_add_join()
        
        # updates can be made after the join
        kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
        
        # iteration
        for kmer, count in kc.items():
            print(kmer, count)
        ```
        
        ## Performance
        `kcollections` is quite a bit slower than the `dict` or `set` but is much more memory-efficient.
        We measured memory usage and running time using `/usr/bin/time -v` on a`Intel(R)
        Xeon(R) E5-2650v4 @2.20GHz` with 256 GB RAM.
        27mers used for testing were taken from the human genome, no repetitive kmers appear in our dataset providing a worst case scenario where no insertions or queries are pruned before traversing the entire data structure.
        
        ### Memory Usage
        
        |# kmers|`kset`|`set`|
        |-------|------|-----|
        |1 million|25.32 MB|105.82 MB|
        |10 million|63.74 MB|906.96 MB|
        |100 million|0.56 GB|11.98 GB|
        |500 million|2.42 GB|48.54 GB|
        |1 billion|4.43 GB|97.07 GB|
        |1.5 billion|6.44 GB|191.61 GB|
        |2 billion|8.44 GB|194.14 GB|
        |2.4 billion|10.08 GB|220.06 GB|
        
        ![Figure of memory usage](./memory_fig.png)
        
        ### Insertion Time
        Insertion time comparisons using built-in Python set, `kcollections` serial and parallel insert.
        
        ![Figure of insertion time](./insert_time.png)
        
        ### Read Mapper and Assembler
        An example read mapping algorithm and assembler are provided using `kcollections` in the `applications` directory.
        
        ## Citation
        
        ## Acknowledgements
        `kcollections` was built at the Computational Science Laboratory at Brigham Young University by Stanley Fujimoto (@masakistan) and Cole Lyman (@colelyman).
        
        Funding was provided by the Utah NASA Space Grant Consortium and the BYU Graduate Research Fellowship.
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
