Metadata-Version: 2.1
Name: cmsip
Version: 0.0.0.7
Summary: UNKNOWN
Home-page: https://github.com/lijinbio/cmsip
Author: Jin Li
Author-email: lijin.abc@gmail.com
License: License :: OSI Approved :: MIT License
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pyyaml
Requires-Dist: matplotlib
Requires-Dist: pandas

# CMSIP

Detecting differential 5hmC regions from CMS-IP sequencing data.

Source URL: [https://github.com/lijinbio/cmsip](https://github.com/lijinbio/cmsip)

![Workflow of CMSIP.](cmsip_flowchart.png)

## Installation

### Dependencies

- bsmap

`bsmap` is a component in the MOABS package. See more at MOABS ([https://github.com/sunnyisgalaxy/moabs](https://github.com/sunnyisgalaxy/moabs)).

- samtools: [http://samtools.sourceforge.net](http://samtools.sourceforge.net)

- bedtools: [https://bedtools.readthedocs.io](https://bedtools.readthedocs.io)

- kentUtils: [https://github.com/ENCODE-DCC/kentUtils](https://github.com/ENCODE-DCC/kentUtils)

## Example configuration file and description

```
sampleinfo:
  - sampleid: TKO2PE1b2
    group: tko
    filenames:
      - TKO2PE1b2_R1.fastq.gz
  - sampleid: TKO2PE2m
    group: tko
    filenames:
      - TKO2PE2b1_R1.fastq.gz
      - TKO2PE2b1_R2.fastq.gz
  - sampleid: WTPE1b2
    group: wt
    filenames:
      - WTPE1b2_R1.fastq.gz
  - sampleid: WTPE2b2
    group: wt
    filenames:
      - WTPE2b2_R1.fastq.gz
groupinfo:
  group1: tko
  group2: wt
resultdir: result
aligninfo:
  reference: /data/jin/resource/genome/fasta/hg38/hg38.fa.gz
  spikein: /data/jin/resource/genome/fasta/mm10/mm10.fa.gz
  fastqdir: test_data
  statfile: qcstats.txt
  barplotinfo:
    outfile: qcstats_twsn_barplot.pdf
    height: 5
    width: 5
  numthreads: 20
  verbose: True
genomescaninfo:
  readextension: True
  fragsize: 100
  windowfile: result/hg38_w200.bed
  referencename: hg38
  windowsize: 200
  readscount: False
  counttablefile: counttable.txt.gz
  verbose: True
dhmrinfo:
  method: 4
  mindepth: 5
  testfile: test.txt.gz
  qthr: 1.05
  maxdistance: 0
  dhmrfile: dhmr.txt.gz
  numthreads: 20
  verbose: True
```

### `sampleinfo`

This block stores detailed metadata information of samples.

### `groupinfo`

This block lists the interested comparison.  The alternative hypothesis is true difference in means of `group1` and `group2` is less than 0.

### `aligninfo`

Options and data information required for alignment.

- reference

The FASTA file for the reference genome, such as hg38.fa.gz.

- spikein

The FASTA file for the spike-in genome, such as mm10.fa.gz.

- windowfile: hg38_w100.bed

The genome in window bins. This window bin file can be generated by using bedtools. E.g.

```
bedtools makewindows -g <(fetchChromSizes hg38) -w 100 > hg38_w100.bed
```

- windowsize: 100

Window size for creating bins.

- fastqdir: test_data

Root directory with raw FASTQ files.

-  outdir

Root output directory for temporary and final result files.

- statfile

QC statistics file. Default is at outdir/qcstats.txt. If this file exists, QC step will be skipped, and size factors will be parsed for the existing QC statistical file. Otherwise, QC step will run to generate the statistics file.

- cnttablefile

Region count table file. Default is at outdir/meancovtable.txt.gz. If this file exists, counting step will be skipped, and the existing count table file will be used for downstream statistical testing. Otherwise, counting step will execute to generate the count table file.

- ttestfile

The statistical testing result file. Default is at outdir/t.test.txt. If this file exists, no more task will run. Otherwise, statistical testing will run on the count table using t-test.




