Metadata-Version: 2.1
Name: snps
Version: 1.2.2
Summary: tools for reading, writing, merging, and remapping SNPs
Home-page: https://github.com/apriha/snps
Author: Andrew Riha
Author-email: apriha@gmail.com
License: BSD 3-Clause License
Project-URL: Changelog, https://github.com/apriha/snps/releases
Project-URL: Issue Tracker, https://github.com/apriha/snps/issues
Keywords: snps dna chromosomes bioinformatics
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.5
Requires-Dist: numpy
Requires-Dist: pandas (<=0.25.3)
Requires-Dist: atomicwrites

.. image:: https://raw.githubusercontent.com/apriha/snps/master/docs/images/snps_banner.png

|build| |codecov| |docs| |pypi| |python| |downloads|

snps
====
tools for reading, writing, merging, and remapping SNPs 🧬

Features
--------
Input / Output
``````````````
- Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing
  sources with a `SNPs <https://snps.readthedocs.io/en/latest/snps.html#snps.snps.SNPs>`_
  object
- Read and write VCF files (e.g., convert `23andMe <https://www.23andme.com>`_ to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the
  process with a
  `SNPsCollection <https://snps.readthedocs.io/en/latest/snps.html#snps.snps_collection.SNPsCollection>`_
  object
- Read data in a variety of formats (e.g., files, bytes, compressed with `gzip` or `zip`)
- Handle several variations of file types, validated via
  `openSNP parsing analysis <https://github.com/apriha/snps/tree/master/analysis/parse-opensnp-files>`_

Build / Assembly Detection and Remapping
````````````````````````````````````````
- Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
- Remap SNPs between builds / assemblies

Data Cleaning
`````````````
- Fix several common issues when loading SNPs
- Sort SNPs based on chromosome and position
- Deduplicate RSIDs
- Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
- Assign PAR SNPs to the X or Y chromosome

Supported Genotype Files
------------------------
``snps`` supports `VCF <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/>`_ files and
genotype files from the following DNA testing sources:

- `23andMe <https://www.23andme.com>`_
- `Ancestry <https://www.ancestry.com>`_
- `Código 46 <https://codigo46.com.mx>`_
- `DNA.Land <https://dna.land>`_
- `Family Tree DNA <https://www.familytreedna.com>`_
- `Genes for Good <https://genesforgood.sph.umich.edu>`_
- `LivingDNA <https://livingdna.com>`_
- `Mapmygenome <https://mapmygenome.in>`_
- `MyHeritage <https://www.myheritage.com>`_
- `Sano Genetics <https://sanogenetics.com>`_

Additionally, ``snps`` can read a variety of "generic" CSV and TSV files.

Dependencies
------------
``snps`` requires `Python <https://www.python.org>`_ 3.5+ and the following Python packages:

- `numpy <http://www.numpy.org>`_
- `pandas <http://pandas.pydata.org>`_
- `atomicwrites <https://github.com/untitaker/python-atomicwrites>`_

Installation
------------
``snps`` is `available <https://pypi.org/project/snps/>`_ on the
`Python Package Index <https://pypi.org>`_. Install ``snps`` (and its required
Python dependencies) via ``pip``::

    $ pip install snps

Examples
--------
Download Example Data
`````````````````````
First, let's setup logging to get some helpful output:

>>> import logging, sys
>>> logger = logging.getLogger()
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we're ready to download some example data from `openSNP <https://opensnp.org>`_:

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz

Load Raw Data
`````````````
Load a `23andMe <https://www.23andme.com>`_ raw data file:

>>> from snps import SNPs
>>> s = SNPs('resources/662.23andme.340.txt.gz')

The ``SNPs`` class accepts a path to a file or a bytes object. A ``Reader`` class attempts to
infer the data source and load the SNPs. The loaded SNPs are available via a ``pandas.DataFrame``:

>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> len(df)
991786

``snps`` also attempts to detect the build / assembly of the data:

>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'

Remap SNPs
``````````
Let's remap the SNPs to change the assembly / build:

>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (``NCBI36``), Build 37 (``GRCh37``), and Build 38
(``GRCh38``).

Merge Raw Data Files
````````````````````
The dataset consists of raw data files from two different DNA testing sources. Let's combine
these files using a ``SNPsCollection``.

>>> from snps import SNPsCollection
>>> sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
Loading resources/662.ftdna-illumina.341.csv.gz
>>> sc.build
36
>>> chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
Downloading resources/NCBI36_GRCh37.tar.gz
>>> sc.snp_count
708092

As the data gets added, it's compared to the existing data, and SNP position and genotype
discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

>>> sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
169
>>> sc.snp_count
1006960

Save SNPs
`````````
Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files,
identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+
SNPs to a CSV file:

>>> saved_snps = sc.save_snps()
Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps = sc.save_snps(vcf=True)
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All `output files <https://snps.readthedocs.io/en/latest/output_files.html>`_ are saved to the
output directory.

Documentation
-------------
Documentation is available `here <https://snps.readthedocs.io/>`_.

Acknowledgements
----------------
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `openSNP <https://opensnp.org>`_,
`Open Humans <https://www.openhumans.org>`_, and `Sano Genetics <https://sanogenetics.com>`_.

.. https://github.com/rtfd/readthedocs.org/blob/master/docs/badges.rst
.. |build| image:: https://travis-ci.org/apriha/snps.svg?branch=master
   :target: https://travis-ci.org/apriha/snps
.. |codecov| image:: https://codecov.io/gh/apriha/snps/branch/master/graph/badge.svg
   :target: https://codecov.io/gh/apriha/snps
.. |docs| image:: https://readthedocs.org/projects/snps/badge/?version=latest
   :target: https://snps.readthedocs.io/
.. |pypi| image:: https://img.shields.io/pypi/v/snps.svg
   :target: https://pypi.python.org/pypi/snps
.. |python| image:: https://img.shields.io/pypi/pyversions/snps.svg
   :target: https://www.python.org
.. |downloads| image:: https://pepy.tech/badge/snps
   :target: https://pepy.tech/project/snps


