Metadata-Version: 1.1
Name: genomelake
Version: 0.1.3
Summary: Simple and efficient random access to genomic data for deep learning models.
Home-page: https://github.com/kundajelab/genomelake
Author: Chuan-Sheng Foo
Author-email: csfoo@cs.stanford.edu
License: BSD-3
Description-Content-Type: UNKNOWN
Description: # genomelake
        Efficient random access to genomic data for deep learning models.
        
        Supports the following types of input data:
        
        - bigwig
        - DNA sequence
        
        genomelake extracts signal from genomic inputs in provided BED intervals.
        
        ## Requirements
        - python 2.7 or 3.5
        - bcolz
        - cython
        - numpy
        - pybedtools
        - pysam
        
        The installer will check (and install) dependencies except cython and pyBigWig.
        
        ## Installation
        1. Install cython with Anaconda by running:
        
        `conda install cython`
        
        2. Install pyBigWig with pip by running:
        
        `pip install pyBigWig`
        
        3. Clone the repository and then do:
        
        `python setup.py install`
        
        ## Getting started: training a protein-DNA binding model
        Extract genome-wide sequence data into a genomelake data source:
        ```
        from genomelake.backend import extract_fasta_to_file
        
        genome_fasta = "/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa"
        genome_data_directory = "./hg19_data_directory"
        extract_fasta_to_file(genome_fasta, genome_data_directory)
        ```
        
        Using a BED intervals file with labels, a genome data source, and genomelake's `ArrayExtractor`, generate input DNA sequences and labels:
        ```
        import pybedtools
        from genomelake.extractors import ArrayExtractor
        import numpy as np
        
        def batch_iter(iterable, batch_size):
            it = iter(iterable)
            try:
                while True:
                    values = []
                    for n in range(batch_size):
                        values += (next(it),)
                    yield values
            except StopIteration:
                yield values
        
        def generate_inputs_and_labels(intervals_file, data_source, batch_size=128):
            bt = pybedtools.BedTool(intervals_file)
            extractor = ArrayExtractor(data_source)
            intervals_generator = batch_iter(bt, batch_size)
            for intervals_batch in intervals_generator:
            	inputs = extractor(intervals_batch)
        	labels = []
        	for interval in intervals_batch:
        	    labels.append(float(interval.name))
                labels = np.array(labels)
                yield inputs, labels
        ```
        
        Train a keras model of JUND binding to DNA using 101 base pair intervals and labels in `./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz`:
        ```
        from keras.models import Sequential
        from keras.layers import Conv1D, Flatten, Dense
        
        intervals_file = "./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz"
        inputs_labels_generator = generate_inputs_and_labels(intervals_file, genome_data_directory)
        
        model = Sequential()
        model.add(Conv1D(15, 25, input_shape=(101, 4)))
        model.add(Flatten())
        model.add(Dense(1, activation='sigmoid'))
        
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.fit_generator(inputs_labels_generator, steps_per_epoch=100)
        ```
        
        Here is the expected result:
        ```
        100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905 
        ```
        
        ## License
        genomelake is released under the BSD-3 license. See ``LICENSE`` for details.
        
Keywords: deeplearning neuralnets genomics
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Operating System :: POSIX :: Linux
