Metadata-Version: 1.1
Name: hicpeaks
Version: 0.2.0
Summary: Identify real loops from Hi-C data.
Home-page: https://github.com/XiaoTaoWang/HiCPeaks/
Author: XiaoTao Wang
Author-email: wangxiaotao686@gmail.com
License: UNKNOWN
Description: HiCPeaks
        ========
        *hicpeaks* provide a Python CPU-based implementation for BH-FDR and HICCUPS, two peak calling algorithms
        for Hi-C data, proposed by Rao et al [1]_.
        
        Installation
        ============
        *hicpeaks* is developed and tested on UNIX-like operating system, and following packages or softwares are
        required:
        
        Python requirements:
        
        a) Python (2.7, not compatible with 3.x for now)
        b) Multiprocess
        c) Numpy
        d) Scipy
        e) Matplotlib
        f) Pandas
        g) Statsmodels
        h) Scikit-Learn
        i) H5py
        j) Cooler
        
        Other requirements:
        
        - ucsc-fetchchromsizes
        
        *conda*, an excellent package manager, can be used to install all requirements above.
        
        Install Conda
        -------------
        .. note:: If you have the Anaconda Distribution installed, you already have it.
        
        Download the latest `Linux Miniconda installer for Python 2.7 <https://conda.io/miniconda.html>`_,
        then in your terminal window type the following and follow the prompts on the installer screens::
        
            $ bash Miniconda2-latest-Linux-x86_64.sh
        
        After that, update the environment variables to finish the Conda installation::
        
            $ source ~/.bashrc
        
        Install Packages through Conda
        ------------------------------
        Conda allows separation of packages into separate repositories, or channels. The main *defaults*
        channel has a large amount of common packages including *numpy*, *scipy*, *pandas*, *statsmodels*,
        *scikit-learn*, and *h5py* listed above. To install these packages, type and execute the following
        command::
        
            $ conda install numpy scipy matplotlib pandas statsmodels scikit-learn h5py
        
        Other packages: *cooler* and *ucsc-fetchchromsizes* are not available in the *defaults* channel
        but included in the *bioconda* channel, and *multiprocess* is included in the *conda-forge* channel.
        To make them accessible, you need to add the *bioconda* channel as well as the other channels bioconda
        depends on (note that the order is important to guarantee the correct priority)::
        
            $ conda config --add channels conda-forge
            $ conda config --add channels defaults
            $ conda config --add channels r
            $ conda config --add channels bioconda
        
        To install these requirements::
        
            $ conda install multiprocess cooler ucsc-fetchchromsizes
        
        Install hicpeaks
        ----------------
        Now just download the `hicpeaks source code <https://pypi.org/project/hicpeaks/>`_ from PyPI, extract it and run
        the setup.py script::
        
            $ python setup.py install
        
        *hicpeaks* would be installed successfully if no exception occurs in the above process.
        
        
        Overview
        ========
        *hicpeaks* comes with 4 scripts: *toCooler*, *pyBHFDR*, *pyHICCUPS* and *peak-plot*.
        
        - toCooler
        
          Store TXT/NPZ bin-level Hi-C data into `cooler <https://github.com/mirnylab/cooler>`_ container.
        
          1. I have included a sample data with *hicpeaks* source code to illustrate how you should prepare your
             data in TXT format. It's quite easy, just remember 3 points: 1. the file name should follow this pattern
             "chrom1_chrom2.txt" (remove prefix from your chromosome labels, i.e. "chr1" should be "1", and "chrX" should
             be "X"); 2. each file should only contain 3 columns, corresponding to "bin1" of "chrom1", "bin2" of "chrom2",
             and "contact frequency" (**don't** perform any normalization processes); 3. all files at the same resolution
             should be placed under a single folder.
          2. NPZ format is another bin-level Hi-C data container which can extremely speed up data loading. *hicpeaks*
             supports NPZ files generated by `runHiC <https://github.com/XiaoTaoWang/HiC_pipeline>`_ and
             `TADLib <https://github.com/XiaoTaoWang/TADLib>`_.
        
        - pyBHFDR
        
          A CPU-based python implementation for BH-FDR algorithm. Rao et al states in their supplementary material that
          this algorithm is robust enough to obtain all main results of their paper. Compared with HICCUPS, BH-FDR doesn't use
          λ-chunk in multiple hypothesis test, and only considers the background Donut region when calculating the
          expected values. Here, *pyBHFDR* follows the algorithm pipelines of [1]_ faithfully except that it doesn't implement
          the greedy clustering algorithm for original peak pixels.
        
        - pyHICCUPS
        
          A CPU-based python implementation for HICCUPS algorithm. Besides the donut region, HICCUPS also considers the
          lower-left, vertical and horizontal backgrounds when calculating the expected values. And λ-chunk is used to overcome
          several multiple hypothesis testing challenges for Hi-C data. Finally, while BH-FDR has to limit the detected pixels
          near the diagonal (<2Mb), HICCUPS is able to generalize itself to any genomic distance in theory. Here, *pyHICCUPS*
          keeps all main concepts of the original algorithm except for these points which may be fixed in the near future:
        
          1. *pyHICCUPS* doesn't implement additional filtering of peak pixels based on local enrichment thresholds.
          2. *pyHICCUPS* doesn't cluster original nearby peak pixels into a single peak call.
          3. I haven't implemented the function to combine peak annotations at different resolutions.
          4. Due to computational complexity, you should still limit the genomic distance of 2 loci to some degree (5Mb/10Mb).
        
          Although these differences, peaks returned by *pyHICCUPS* are quite consistent with our visual inspection, and
          generally follow the typical loop interaction patterns.
        
        - peak-plot
        
          Visualize peaks (or loops) detected by *pyBHFDR* or *pyHICCUPS* on heatmap. Just provide a cooler file and a loop
          annotation file, and input your interested region (chrom, start, end), *peak-plot* will export the figure in PNG
          format.
        
        
        QuickStart
        ==========
        This tutorial will guide you through the basic usage of all scripts distributed with *hicpeaks*.
        
        toCooler
        --------
        If you have already created a cooler file for your Hi-C data, skip to the next section
        `pyBHFDR and pyHICCUPS <https://github.com/XiaoTaoWang/HiCPeaks/blob/master/README.rst#pybhfdr-and-pyhiccups>`_,
        go on otherwise.
        
        First, you should store your TXT/NPZ bin-level Hi-C data into a cooler file by using *toCooler*. Let's begin
        with our sample data below. Suppose you are still in the *hicpeaks* distribution root folder: change your current
        working directory to the sub-folder *example*::
        
            $ cd example
            $ ls -lh *
        
            -rw-r--r--  1 xtwang  staff    18B Aug 21 19:46 datasets
            -rw-r--r--  1 xtwang  staff   293B Aug 23 20:53 hg38.chromsizes
        
            40K:
            total 11608
            -rw-r--r--  1 xtwang  staff   2.7M Aug 21 19:44 21_21.txt
            -rw-r--r--  1 xtwang  staff   2.9M Aug 21 19:44 22_22.txt
        
        There are one sub-directory called *40K* which contains Hi-C data of two chromosomes in K562 cell line at 40K resolution,
        and one metadata file *datasets* which we can pass directly to *toCooler*::
        
            $ cd 40K
            $ head -5 21_21.txt
        
            250	251	1
            250	258	1
            250	259	1
            250	260	4
            250	261	2
        
            $ cd ..
            $ cat datasets
        
            res:40000
              ./40K
        
        You should construct your TXT files (no head, no tail) with 3 columns, which indicate "bin1 of the 1st chromosome",
        "bin2 of the 2nd chromosome" and "contact frequency" respectively. See `Overview <https://github.com/XiaoTaoWang/HiCPeaks#overview>`_
        above.
        
        To transform this data to *cooler* format, just run the command below::
        
            $ toCooler -O K562-MboI-parts.cool -d datasets --assembly hg38 --nproc 2
        
        *toCooler* routinely fetch sizes of each chromosome from UCSC with the provided genome assembly name (here hg38).
        However, if your reference genome is not holded in UCSC, you can also build a file like "hg38.chromsizes" in
        current working directory, and pass the file path to the argument "--chromsizes-file".
        
        Type ``toCooler`` with no arguments on your terminal to print detailed help information for each parameter.
        
        For this datasets, *toCooler* will create a cooler file named "K562-MboI-parts.cool", and your data will be stored under
        the URI "K562-MboI-parts.cool::40000".
        
        This tutorial only illustrates a very simple case, in fact the metadata file may contain list of resolutions (if you
        have data at different resolutions in the same cell line) and corresponding folder paths (both relative and absolute
        path are accepted, and if your data are NPZ format, this path should point to the NPZ file)::
        
            res:10000
              /absoultepath/10K
            
            res:20000
              ../relativepath/20K
            
            res:40000
              /npzfile/anyprefix.npz
        
        Then *toCooler* will generate a single cooler file storing all the specified data under different cooler URI:
        "specified_cooler_path::10000", "specified_cooler_path::20000" and "specified_cooler_path::40000".
        
        pyBHFDR and pyHICCUPS
        ---------------------
        With cooler URI, you can perform peak annotation by *pyBHFDR* or *pyHICCUPS*::
        
            $ pyBHFDR -O K562-MboI-BHFDR-loops.txt -p K562-MboI-parts.cool::40000 -C 21 22 --pw 1 --ww 3
        
        Or::
        
            $ pyHICCUPS -O K562-MboI-HICCUPS-loops.txt -p K562-MboI-parts.cool::40000 --pw 1 --ww 3
        
        Type ``pyBHFDR`` or ``pyHICCUPS`` on your terminal to print detailed help information for each parameter.
        
        Before step to the next section, let's list the contents under current working directory again::
        
            $ ls -lh
        
            total 2360
            drwxr-xr-x  5 xtwang  staff   160B Aug 25 23:18 40K
            -rw-r--r--  1 xtwang  staff   3.4K Aug 25 23:19 BHFDR.log
            -rw-r--r--  1 xtwang  staff   7.3K Aug 25 23:20 HICCUPS.log
            -rw-r--r--  1 xtwang  staff   268K Aug 25 23:19 K562-MboI-BHFDR-loops.txt
            -rw-r--r--  1 xtwang  staff    38K Aug 25 23:20 K562-MboI-HICCUPS-loops.txt
            -rw-r--r--  1 xtwang  staff   704K Aug 25 23:19 K562-MboI-parts.cool
            -rw-r--r--  1 xtwang  staff    18B Aug 25 23:18 datasets
            -rw-r--r--  1 xtwang  staff   293B Aug 25 23:18 hg38.chromsizes
            -rw-r--r--  1 xtwang  staff    29K Aug 25 23:19 tocooler.log
        
        Peak Visualization
        ------------------
        Now, you can visualize BH-FDR and HICCUPS peak annotations on heatmap with *peak-plot*.
        
        For BH-FDR peaks::
        
            $ peak-plot -O test-BHFDR.png --dpi 250 -p K562-MboI-parts.cool::40000 -I K562-MboI-BHFDR-loops.txt -C 21 -S 40000000 -E 43000000 --correct --siglevel 0.0001
        
        The output figure should look like this:
        
        .. image:: ./figures/test-BHFDR.png
                :align: center
        
        
        For HICCUPS peaks::
        
            $ peak-plot -O test-HICCUPS.png --dpi 250 -p K562-MboI-parts.cool::40000 -I K562-MboI-HICCUPS-loops.txt -C 21 -S 40000000 -E 43000000 --correct --siglevel 0.1
        
        And the output plot:
        
        .. image:: ./figures/test-HICCUPS.png
                :align: center
        
        
        Notes
        -----
        Although *hicpeaks* currently cannot perform further filtering based on local enrichment thresholds, you can do
        it by yourself with output annotations of *pyBHFDR* and *pyHICCUPS*.
        
        
        Reference
        =========
        .. [1] Rao SS, Huntley MH, Durand NC et al. A 3D Map of the Human Genome at Kilobase Resolution
              Reveals Principles of Chromatin Looping. Cell, 2014, 159(7):1665-80.
        
Keywords: Hi-C interaction contact loop peak
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 2.7
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: POSIX
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
