Metadata-Version: 2.1
Name: cmsip
Version: 0.0.0.3
Summary: UNKNOWN
Home-page: https://github.com/lijinbio/cmsip
Author: Jin Li
Author-email: lijin.abc@gmail.com
License: License :: OSI Approved :: MIT License
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pyyaml
Requires-Dist: matplotlib
Requires-Dist: pandas

# CMSIP

Detecting differential 5hmC regions from CMS-IP sequencing data.

## Installation

### Dependencies

- bsmap

`bsmap` is a component in the MOABS package. See more at MOABS ([https://github.com/sunnyisgalaxy/moabs](https://github.com/sunnyisgalaxy/moabs)).

- samtools: [http://samtools.sourceforge.net](http://samtools.sourceforge.net)

- bedtools: [https://bedtools.readthedocs.io](https://bedtools.readthedocs.io)

- kentUtils: [https://github.com/ENCODE-DCC/kentUtils](https://github.com/ENCODE-DCC/kentUtils)

## Example configuration file and description

```
sampleinfo:
  - sampleid: TKO2PE1b2
    group: tko
    filenames:
      - TKO2PE1b2_R1.fastq.gz
  - sampleid: TKO2PE2m
    group: tko
    filenames:
      - TKO2PE2b1_R1.fastq.gz
      - TKO2PE2b1_R2.fastq.gz
  - sampleid: WTPE1b2
    group: wt
    filenames:
      - WTPE1b2_R1.fastq.gz
  - sampleid: WTPE2b2
    group: wt
    filenames:
      - WTPE2b2_R1.fastq.gz
groupinfo:
  group1: tko
  group2: wt
datainfo:
  reference: hg38.fa.gz
  spikein: mm10.fa.gz
  windowfile: hg38_w100.bed
  windowsize: 100
  fastqdir: test_data
  outdir: outdir
  statfile: outdir/qcstats.txt
  cnttablefile: outdir/meancovtable.txt.gz
  ttestfile: outdir/t.test.txt
numthreads: 20
verbose: True
```

### `sampleinfo`

This block stores detailed metadata information of samples.

### `groupinfo`

This block lists the interested comparison.  The alternative hypothesis is true difference in means of `group1` and `group2` is less than 0.

### `datainfo`

- reference

The FASTA file for the reference genome, such as hg38.fa.gz.

- spikein

The FASTA file for the spike-in genome, such as mm10.fa.gz.

- windowfile: hg38_w100.bed

The genome in window bins. This window bin file can be generated by using bedtools. E.g.

```
bedtools makewindows -g <(fetchChromSizes hg38) -w 100 > hg38_w100.bed
```

- windowsize: 100

Window size for creating bins.

- fastqdir: test_data

Root directory with raw FASTQ files.

-  outdir

Root output directory for temporary and final result files.

- statfile

QC statistics file. Default is at outdir/qcstats.txt. If this file exists, QC step will be skipped, and size factors will be parsed for the existing QC statistical file. Otherwise, QC step will run to generate the statistics file.

- cnttablefile

Region count table file. Default is at outdir/meancovtable.txt.gz. If this file exists, counting step will be skipped, and the existing count table file will be used for downstream statistical testing. Otherwise, counting step will execute to generate the count table file.

- ttestfile

The statistical testing result file. Default is at outdir/t.test.txt. If this file exists, no more task will run. Otherwise, statistical testing will run on the count table using t-test.




