Metadata-Version: 2.1
Name: pysumstats
Version: 0.3.1
Summary: Package for working with GWAS summary statistics
Home-page: https://github.com/matthijsz/pysumstats
Author: Matthijs D. van der Zee
Author-email: m.d.vander.zee@vu.nl
License: MIT
Keywords: gwas summary statistics genetics
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: tables
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: matplotlib

[![Documentation Status](https://readthedocs.org/projects/pysumstats/badge/?version=latest)](https://pysumstats.readthedocs.io/en/latest/?badge=latest)
[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![PyPI version](https://badge.fury.io/py/pysumstats.svg)](https://badge.fury.io/py/pysumstats)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build Status](https://travis-ci.org/matthijsz/pysumstats.svg?branch=master)](https://travis-ci.org/matthijsz/pysumstats)

# Patch notes

##### 13-05-2020 (v0.3.1)
 - Fixed an issue where reading data would fail when values in n, bp, chr columns were NA. An attempt is now made to impute these values. If too many are missing a ValueError is thrown.

##### 12-05-2020 (v0.3)
 - Added `fig` and `ax` arguments to `pysumstats.plot.qqplot` and `pysumstats.plot.manhattan` to enable plotting to existing figure and axis.
 - Added `pysumstats.plot.pzplot`, to visually compare Z-values from `B/SE` to Z-values calculated from the P-value.
 - Added `pysumstats.plot.afplot`, to plot allele frequency differences between summary statistics.
 - Added `pysumstats.plot.zzplot`, to plot differences in Z-values between summary statistics.
 - Added `qqplot`, `manhattan`, `pzplot`, `afplot`, `zzplot` functions to MergedSumStats object.
 - Added `pzplot` function to SumStats object.
 - Added `plot_all` functions to SumStats and MergedSumStats objects to automatically generate all possible plots for the object.

##### 11-05-2020 (v0.2.3)

 - Added `return` statement to MergedSumStats.merge() when `inplace=False` and merging with other MergedSumstats.
 - Added docstrings to base, mergedsumstats, sumstats and utils.
 - Added [docs](https://pysumstats.readthedocs.io/en/latest/)
 - Fixed import errors and added `manhattan` and `qq` function to `SumStats` class

##### 08-05-2020 (v0.2)

 - Added `plot` subpackage with `qqplot` and `manhattan`,  from  my initial [Python-QQMan module](https://github.com/matthijsz/qqman).

##### 08-05-2020 (v0.1)

 - Adapted to be a package rather then a module.
 - Added `low_ram` argument to SumStats to read/write data to disk rather than RAM, in case of memory issues.  

# Description

A python package for working with GWAS summary statistics data in Python. <br/>
This package is designed to make it easy to read summary statistics, perform QC, merge summary statistics and perform meta-analysis.<br/>
Meta-analysis can be performed with `.meta()` with inverse-variance weighted or samplesize-weighted methods.<br/>
GWAMA as described in [Baselmans, et al. (2019)](https://www.nature.com/articles/s41588-018-0320-8) can be performed using the `.gwama()` function in merged summary statistics. <br/>
The plotting package uses matplotlib.pyplot for generating figures, so the functions are generally compatible with matplotlib.pyplot colors, and Figure and Axis objects. <br/>
Warning: merging with low_memory enabled is still highly experimental. <br/>

# Reference

Using the pysumstats package for a publication, or something similar? That is **awesome**! <br/>
There is no publication attached to this package, 
and I am not going to force anyone to reference me or make me a co-author or whatever, I want this to remain easily accessible. 
But I would greatly appreciate it if you add a link to this github, or a reference to it in the acknowledgements or something like that. <br/>
If you have any questions, want to help add methods or want to let me know you are planning a publication with this, you can get in touch via the [pypi website of this project](https://pypi.org/project/pysumstats/).

# Installation

This package was made for Python 3.7. Clone the package directly from this github, or install with 

`pip3 install pysumstats`


# Usage

`import pysumstats as sumstats`
###### Reading files
`s1 = sumstats.SumStats("sumstats1.csv.gz", phenotype='GWASsummary1')`
###### Reading data without sample size column: you will manually have to specify gwas sample size
`s2 = sumstats.SumStats("sumstats2.txt.gz", phenotype='GWASsummary2', gwas_n=350492)`
###### Reading data with column names not automatically recognized:
```
s3 = sumstats.SumStats("sumstats3.csv", phenotype='GWASsummary3',
                              column_names={
                                    'rsid': 'weird_name_for_rsid',
                                    'chr': 'weird_name_for_chr',
                                    'bp': 'weird_name_for_bp',
                                    'ea': 'weird_name_for_ea',
                                    'oa': 'weird_name_for_oa',
                                    'maf': 'weird_name_for_maf',
                                    'b': 'weird_name_for_b',
                                    'se': 'weird_name_for_se',
                                    'p': 'weird_name_for_p',
                                    'hwe': 'weird_name_for_p_hwe',
                                    'info': 'weird_name_for_info',
                                    'n': 'weird_name_for_n',
                                    'eaf': 'weird_name_for_eaf',
                                    'oaf': 'weird_name_for_oaf'})
```
###### Performing qc
```
s1.qc(maf=.01)
s2.qc(maf=.01, hwe=1e-6, info=.9)
s3.qc()  # MAF .01 is the default
```
###### Merging sumstats, low_memory option is still experimental so be carefull with that
`merge1 = s1.merge(s2)`

###### Meta analysis
```
n_weighted_meta = merge1.meta_analyze(name='meta1', method='samplesize')  # N-weighted meta analysis
ivw_meta = merge1.meta_analyze(name='meta1', method='ivw')  # Standard inverse-variance weighted meta analysis
gwama = merge1.gwama(name='meta1', method='ivw')  # GWAMA as described in Baselmans, et al. (2019)
```
###### Additionally supports adding SNP heritabilities as weights
`exc_meta = exc.gwama(h2_snp={'ntr_exc': .01, 'ukb_ssoe': .02}, name='exc', method='ivw')`
###### And your own covariance matrix (called cov_Z in most R scripts)
```
# Either read it from a file:
import pandas as pd
cov_z = pd.read_csv('my_cov_z.csv') # Note it should be pandas dataframe with column names and index names equal to your phenotypes

# Or generate it from a phenotype file yourself:
phenotypes = pd.read_csv('my_phenotype_file.csv')
cov_z = sumstats.cov_matrix_from_phenotype_file(phenotypes, phenotypes=['GWASsummary1', 'GWASsummary2'])

gwama = exc.gwama(cov_matrix=cov_z, h2_snp={'GWASsummary1': .01, 'GWASsummary2': .02}, name='meta1', method='ivw')
```
###### See a summary of the result
`gwama.describe()`
###### See head of the data
`gwama.head()`
###### See head of all chromosomes
`gwama.head(n_chromosomes=23)`

###### QQ and Manhattan plots of the result
```
gwama.manhattan(filename='meta_manhattan.png')
gwama.qqplot(filename='meta_qq.png')
``` 

###### Save the result as csv
`exc.save('exc_sumstats.csv')`
###### Save the result as a pickle file (way faster to save and load back into Python)
`exc.save('exc_sumstats.pickle')`

###### Merge gwama results with another file:
`merged = gwama.merge(s3)`
###### Save prepped files for MR analysis in R:
```
merged.prep_for_mr(exposure='GWASsummary3', outcome='meta1',
                   filename=['GWAS3-Meta.csv', 'Meta-GWAS3.csv'],
                   p_cutoff=5e-8, bidirectional=True, index=False)
```
The resulting files will have the following column names, per specification of the MendelianRandomization package in R:

`rsid	chr	bp	exposure.A1	exposure.A2	outcome.A1	outcome.A2	exposure.se	exposure.b	outcome.se	outcome.b`

###### Some other stuff:
```
# See column names of the file
gpc_neuro.columns

# SumStats support for standard indexing is growing:
exc[0]  # Get the full output of the first SNP
exc[:10]  # Get the full output of the first 10 SNPs
exc[:10, 'p']  # Get the p value of the first 10 SNPs
exc['p']  # Get the p values of all SNPs
exc['rs78948828']  # Get the full output of 1 specific rsid
exc[['rs78948828', 'rs6057089', 'rs55957973']]  # Get the full output of multiple specific rsids
exc[['rs78948828', 'rs6057089', 'rs55957973'], 'p']  # Get the p-value for specific rsids

# If for whatever reason you want to do stuff with each SNP individually you can also loop over the entire file
for snp_output in exc:
    if exc['p'] < 5e-8:
        print('Yay significant SNP!')
    # do something


# If you only want to loop over some specific columns, you can
for rsid, b, se, p in exc[['rsid', 'b', 'se', 'p']].values:
    if p < 5e-8:
        print('Yay significant SNP!')


```



