Metadata-Version: 2.1
Name: seekr
Version: 1.1.1
Summary: A library for counting small kmer frequencies in nucleotide sequences.
Home-page: https://github.com/CalabreseLab/seekr
Author: Jessime Kirk
Author-email: jessime.kirk@gmail.com
License: MIT
Description: # SEEKR
        
        A library for counting small kmer frequencies in nucleotide sequences.
        
        A webportal is available at [seekr.org](seekr.org).
        
        ## Installation
        
         * To use this library, you have to have Python3.x on your computer. If you don't have it installed, the easiest place to get it is from the [Anaconda distribution](https://www.continuum.io/downloads). Downloading Anaconda will also provide you with most of the dependencies you need to use SEEKR.
         * Either download this repository as a .zip file, or use git to `clone` it.
         * Install any missing dependencies by running: `pip install -r requirements.txt`.
        
        ## Usage
        
        You can either use SEEKR from the command-line or as a python module. In either case, you will use `kmer_counts.py` to generate a kmer count matrix of m rows by n columns,
        where m is the number of transcripts in a fasta file and n is 4^kmer. Then  `pearson.py` can be used to calculate how well correlated all pairwise combinations of sequences are.
        
        **Note:** Some advanced usages are currently not available from the command-line and require that you import the module.
        
        Here are some quick-start examples if you just want to get going:
        
        ### Examples
        
        #### kmer_counts
        
        The default settings produce a binary, normalized numpy file:
        
        ```   
        $ python kmer_counts.py /path/to/rnas.fa -o /path/to/out.npy
        ```
        
        To get a human readable csv file, set the nonbinary flag:
        
        ```
        $ python kmer_counts.py /path/to/rnas.fa -o /path/to/out.csv -nb
        ```
        
        If you want to add default labels, also set the label flag:
        
        ```
        $ python kmer_counts.py /path/to/rnas.fa -o /path/to/out.csv -nb -lb
        ```
        
        You can change also change the size of the kmer you're using, and prevent normalization:
        
        ```
        $ python kmer_counts.py /path/to/rnas.fa -o /path/to/out.npy -k 4 -nc -ns
        ```
        
        #### pearson
        
        To find Pearson correlations between kmer count profiles, run `pearson.py`. Running the program and options are similar to `kmers_counts.py`. Input files for `pearson.py` will always be the output files from one or more runs of `kmer_counts.py`. The default setting accept two numpy files and output a third numpy file.
        
        ```
        $ python pearson.py /path/to/kc_out.npy /path/to/kc_out.npy -o /path/to/out.npy
        ```
        
        The only other options besides the `-o` flag control binary versus .csv input and output. If you have a non binary input file (i.e. a .csv file) and also want a non binary output file, you can do:
        
        ```
        $ python pearson.py /path/to/kc_out.npy /path/to/kc_out.npy -o /path/to/out.npy -nbi -nbo
        ```
        
        ### Advanced usage
        
        A common task is to use the normalization vectors from a large .fa file to analyze specific lncRNAs of interest. Currently, this cannot be done from the command-line, but is still fairly straightforward from within python. This example will serve as a walkthrough on using SEEKR as a module and will take you through creating the normalization vectors (using 4mers) to creating the Pearson's similarity matrix.
        
        ```python
        import numpy as np
        from kmer_counts import BasicCounter
        from pearson import pearson
        
        v22_gencode = 'v22_gencode.fa'
        gencode_counter = BasicCounter(v22_gencode, k=4)
        gencode_counter.get_counts()
        np.save('mean.npy', gencode_counter.mean)
        np.save('std.npy', gencode_counter.std)
        
        xist = 'xist.fa'
        lncRNAs = 'other_lncs.fa'
        xist_counter = BasicCounter(xist, '4mers_xist.npy', mean='mean.npy', std='std.npy', k=4)
        lncs_counter = BasicCounter(lncRNAs, '4mers_lncs.npy', mean='mean.npy', std='std.npy', k=4)
        xist_counter.make_count_file()
        lncs_counter.make_count_file()
        
        sim = pearson(xist_counter.counts, lncs_counter.counts, outfile='xist_vs_lncs.npy')
        
        ```
        
        This will write five files to disk:
        
        * `mean.npy`: Mean vector for gencode human lncRNAs. Once this has been saved, the first portion of the code doesn't need to be run again.
        * `std.npy`: Standard deviation vector for gencode human lncRNAs.
        * `4mers_xist.npy`: Normalized kmer profile for Xist.
        * `5mers_lncs.npy`: Normalized kmer profile for other lncRNAs of interest.
        * `xist_vs_lncs.npy`: Pearson's r values for all pairwise comparisons between Xist and the other lncRNAs.
        
        ### Help
        
        For full documentation of the parameters and flags, you can run `kmer_counts.py`  or  `pearson.py` without any arguments.
        
        ```
        $ python kmer_counts.py
        ```
        
        ## Issues
        
        Any suggestions, questions, or problems can be directed to our [GitHub Issues page](https://github.com/CalabreseLab). #TODO update to full link after acceptance.
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
