Metadata-Version: 1.2
Name: pysradb
Version: 0.8.0
Summary: Python package for interacting with SRAdb and downloading datasets from SRA
Home-page: https://github.com/saketkc/pysradb
Author: Saket Choudhary
Author-email: saketkc@gmail.com
License: BSD license
Description: #######
        pysradb
        #######
        
        .. image:: https://zenodo.org/badge/159590788.svg
            :target: https://zenodo.org/badge/latestdoi/159590788
        
        .. image:: https://img.shields.io/pypi/v/pysradb.svg?style=flat-square
            :target: https://pypi.python.org/pypi/pysradb
        
        .. image:: https://img.shields.io/travis/saketkc/pysradb.svg?style=flat-square
            :target: https://travis-ci.com/saketkc/pysradb
        
        .. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
            :target: http://bioconda.github.io/recipes/pysradb/README.html
        
        .. image:: https://codecov.io/gh/saketkc/pysradb/branch/master/graph/badge.svg?style=flat-square
            :target: https://codecov.io/gh/saketkc/pysradb
        
        Python package for interacting with SRAdb and downloading datasets from SRA.
        (python3 only!)
        
        .. raw:: html
        
            <a href="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx?speed=5&autoplay=1" target="_blank"><img src="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx.svg" /></a>
        
        
        *********
        CLI Usage
        *********
        
        ``pysradb`` supports command line ussage. The documentation
        is in progress. See  `cmdline <https://github.com/saketkc/pysradb/blob/master/docs/cmdline.rst>`_ for
        some quick usage instructions. See `quickstart <https://www.saket-choudhary.me/pysradb/quickstart.html#the-full-list-of-possible-pysradb-operations>`_ for
        a list of instructions for each sub-command.
        
        
        ::
        
           $ pysradb
        
            Usage: pysradb [OPTIONS] COMMAND [ARGS]...
        
              pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
        
              Citation: Pending.
        
            Options:
              --version   Show the version and exit.
              -h, --help  Show this message and exit.
        
            Commands:
              download    Download SRA project (SRPnnnn)
              gse-to-gsm  Get GSM for a GSE
              gse-to-srp  Get SRP for a GSE
              gsm-to-gse  Get GSE for a GSM
              gsm-to-srp  Get SRP for a GSM
              gsm-to-srr  Get SRR for a GSM
              gsm-to-srx  Get SRX for a GSM
              metadata    Fetch metadata for SRA project (SRPnnnn)
              metadb      Download SRAmetadb.sqlite
              search      Search SRA for matching text
              srp-to-gse  Get GSE for a SRP
              srp-to-srr  Get SRR for a SRP
              srp-to-srs  Get SRS for a SRP
              srr-to-gsm  Get GSM for a SRR
              srp-to-srx  Get SRX for a SRP
              srr-to-srp  Get SRP for a SRR
              srr-to-srs  Get SRS for a SRR
              srr-to-srx  Get SRX for a SRR
              srs-to-srx  Get SRX for a SRS
              srx-to-srp  Get SRP for a SRX
              srx-to-srr  Get SRR for a SRX
              srx-to-srs  Get SRS for a SRX
        
        
        ************
        Installation
        ************
        
        
        To install stable version using `pip`:
        
        .. code-block:: bash
        
           pip install pysradb
        
        Alternatively, if you use conda:
        
        .. code-block:: bash
        
           conda install -c bioconda pysradb
        
        This step will install all the dependencies except aspera-client_ (which is not required, but highly recommended).
        If you have an existing environment with a lot of pre-installed packages, conda might be `slow <https://github.com/bioconda/bioconda-recipes/issues/13774>`_.
        Please consider creating a new enviroment for ``pysradb``:
        
        .. code-block:: bash
        
           conda create -c bioconda -n pysradb PYTHON=3 pysradb
        
        Dependecies
        ===========
        
        .. code-block:: bash
        
           pandas>=0.23.4
           tqdm>=4.28
           click>=7.0
           aspera-client
           SRAmetadb.sqlite
        
        Downloading SRAmetadb
        =====================
        
        We need a SQLite database file that has preprocessed metadata made available by the
        `SRAdb <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19>`_ project.
        
        SRAmetadb can be downloaded using:
        
        .. code-block:: bash
        
           wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
        
        Alternatively, you can also download it using ``pysradb``, which by default downloads it into your
        current working directory:
        
        
        ::
        
            $ pysradb metadb
        
        You can also specify an alternate directory for download by supplying the ``--out-dir <OUT_DIR>`` argument.
        
        .. _aspera-client:
        
        
        aspera-client
        =============
        
        We strongly recommend using ``aspera-client`` (which uses UDP) since it `warrants faster downloads <http://www.skullbox.net/tcpudp.php>`_ as compared to ``ftp/http`` based downloads.
        
        PDF intructions are available on IBM's `website <https://downloads.asperasoft.com/connect2/>`_.
        
        Direct download links:
        
        - `Linux <https://download.asperasoft.com/download/sw/connect/3.8.1/ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz>`_
        - `MacOS <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnectInstaller-3.8.1.161274.dmg>`_
        - `Windows: <https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnect-ML-3.8.1.161274.msi>`_
        
        Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:
        
        .. code-block:: bash
        
           tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
           bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
           Installing IBM Aspera Connect
           Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
           Install complete.
        
        
        Installing pysradb in development mode
        ======================================
        
        .. code-block:: bash
        
           pip install -U pandas tqdm
           git clone https://github.com/saketkc/pysradb.git
           cd pysradb
           pip install -e .
        
        
        
        *************
        Using pysradb
        *************
        
        Please see `usage_scenarios <https://saket-choudhary.me/pysradb/usage_scenarios.html>`_ for a few usage scenarios.
        Here are few hand-picked examples.
        
        
        Getting SRA metadata
        ====================
        
        ::
        
            $ pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head
        
            study_accession experiment_accession sample_accession run_accession library_strategy batch         biomaterial_provider             biomaterial_type cell_type    collection_method differentiation_method                                                                                                                     differentiation_stage                                                                disease                                                          donor_age donor_ethnicity                 donor_health_status                                                                                 donor_id donor_sex line          lineage                                                               medium                                                                                                                                                                                                   molecule     passage                             sample_term_id  sex     source_name              tissue                   tissue_depot tissue_type
            SRP000941       SRX006235            SRS004118        SRR018454     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006236            SRS004118        SRR018456     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006237            SRS004118        SRR018455     ChIP-Seq         NaN           cellular dynamics international  cell line        NaN          NaN               none                                                                                                                                       none                                                                                 none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            embryonic stem cell                                                   mteser                                                                                                                                                                                                   genomic dna  between 30 and 50                   efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019072     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019080     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019081     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019082     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019083     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
            SRP000941       SRX006239            SRS004213        SRR019084     Bisulfite-Seq    #2            thomson laboratory               cell line        NaN          NaN               na                                                                                                                                         embryonic stem cell                                                                  none                                                             NaN       NaN                             NaN                                                                                                 NaN      NaN       h1            na                                                                    tesr                                                                                                                                                                                                     genomic dna  27                                  efo_0003042     male    NaN                      NaN                      NaN          NaN
        
        
        Getting detailed SRA metadata
        =============================
        
        ::
        
            $ pysradb metadata --db ./SRAmetadb.sqlite SRP075720 --detailed --expand | head
        
            study_accession experiment_accession sample_accession run_accession experiment_title                                  experiment_attribute        taxon_id library_selection library_layout library_strategy library_source  library_name  bases      spots   adapter_spec  avg_read_length developmental_stage retina_id source_name                tissue
            SRP075720       SRX1800089           SRS1467259       SRR3587529    GSM2177186: Kcng4_1Ra_A10; Mus musculus; RNA-Seq  GEO Accession: GSM2177186  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         79101650   1582033  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800090           SRS1467260       SRR3587530    GSM2177187: Kcng4_1Ra_A11; Mus musculus; RNA-Seq  GEO Accession: GSM2177187  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         84573650   1691473  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800091           SRS1467261       SRR3587531    GSM2177188: Kcng4_1Ra_A12; Mus musculus; RNA-Seq  GEO Accession: GSM2177188  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77835550   1556711  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800092           SRS1467262       SRR3587532    GSM2177189: Kcng4_1Ra_A1; Mus musculus; RNA-Seq   GEO Accession: GSM2177189  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         73905150   1478103  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800093           SRS1467263       SRR3587533    GSM2177190: Kcng4_1Ra_A2; Mus musculus; RNA-Seq   GEO Accession: GSM2177190  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77193150   1543863  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800094           SRS1467264       SRR3587534    GSM2177191: Kcng4_1Ra_A3; Mus musculus; RNA-Seq   GEO Accession: GSM2177191  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         59205550   1184111  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800095           SRS1467265       SRR3587535    GSM2177192: Kcng4_1Ra_A4; Mus musculus; RNA-Seq   GEO Accession: GSM2177192  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         61794700   1235894  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800096           SRS1467266       SRR3587536    GSM2177193: Kcng4_1Ra_A5; Mus musculus; RNA-Seq   GEO Accession: GSM2177193  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         78437650   1568753  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
            SRP075720       SRX1800097           SRS1467267       SRR3587537    GSM2177194: Kcng4_1Ra_A6; Mus musculus; RNA-Seq   GEO Accession: GSM2177194  10090     cDNA              SINGLE -       RNA-Seq          TRANSCRIPTOMIC  None         77392700   1547854  None         50.0             p17                 1ra       mus musculus retina__ p17  retina
        
        
        Converting SRP to GSE
        =====================
        
        ::
        
            $ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720
        
            study_accession study_alias
            SRP075720       GSE81903
        
        
        Converting GSM to SRP
        =====================
        
        ::
        
            $ pysradb gsm-to-srp --db ./SRAmetadb.sqlite GSM2177186
        
            experiment_alias study_accession
            GSM2177186       SRP075720
        
        
        Converting GSM to GSE
        =====================
        
        ::
        
            $ pysradb gsm-to-gse --db ./SRAmetadb.sqlite GSM2177186
        
            experiment_alias study_alias
            GSM2177186       GSE81903
        
        
        Converting GSM to SRX
        =====================
        
        ::
        
            $ pysradb gsm-to-srx --db ./SRAmetadb.sqlite GSM2177186
        
            experiment_alias experiment_accession
            GSM2177186       SRX1800089
        
        
        Converting GSM to SRR
        =====================
        
        ::
        
            $ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186
        
            experiment_alias run_accession
            GSM2177186       SRR3587529
        
        
        Complete Metadata for any record
        ================================
        
        Use the ``--detailed`` flag:
        
        ::
        
            $ pysradb gsm-to-srr --db ./SRAmetadb.sqlite GSM2177186 --detailed --desc --expand
        
            experiment_alias run_accession experiment_accession sample_accession study_accession run_alias      sample_alias study_alias developmental_stage retina_id source_name                tissue
            GSM2177186       SRR3587529    SRX1800089           SRS1467259       SRP075720       GSM2177186_r1  GSM2177186   GSE81903    p17                 1ra       mus musculus retina__ p17  retina
        
        
        Getting only the assay type
        ===========================
        
        ::
        
            $ pysradb metadata SRP000941 --db ./SRAmetadb.sqlite --assay  | tr -s '  ' | cut -f5 -d ' ' | sort | uniq -c
        
            999 Bisulfite-Seq
            768 ChIP-Seq
              1 library_strategy
            121 OTHER
            353 RNA-Seq
             28 WGS
        
        
        Downloading entire project
        ==========================
        
        ``pysradb`` makes it super easy to download datasets from SRA.
        
        ::
        
            $ pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852
        
        Downloads are organized by ``SRP/SRX/SRR`` mimicking the hiererachy of SRA projects.
        
        
        Downloading only certain samples of interest
        ============================================
        
        ::
        
            $ pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download
        
        This will download all ``RNA-seq`` samples coming from this project using ``aspera-client``, if available.
        Alternatively, it can also use ``wget``.
        
        **************
        Demo Notebooks
        **************
        
        These notebooks document all the possible features of `pysradb`:
        
        1. `Python API usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/01.SRAdb-demo.ipynb>`_
        2. `Command line usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/03.CommandLine-demo.ipynb>`_
        
        
        
        ********
        Citation
        ********
        
        Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
        
        DOI: 10.5281/zenodo.2306881
        
        A lot of functionality in ``pysradb`` is based on ideas from the original `SRAdb package <https://bioconductor.org/packages/release/bioc/html/SRAdb.html>`_. Please cite the original SRAdb publication:
        
            Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. "SRAdb: query and use public next-generation sequencing data from within R." BMC bioinformatics 14, no. 1 (2013): 19.
        
        
        * Free software: BSD license
        * Documentation: https://saketkc.github.io/pysradb
        
        
        #######
        History
        #######
        
        *******************
        0.8.0 (02-26-2019)
        *******************
        
        New methods/functionality
        =========================
        * `srr-to-gsm`: convert SRR to GSM
        * SRAmetadb.sqlite.gz file is deleted by default after extraction
        * When SRAmetadb is not found a confirmation is seeked before downloading
        * Confirmation option before SRA downloads
        
        Bugfix
        ======
        * download() works with wget
        
        Others
        ======
        
        * `--out_dir` is now `out-dir`
        
        
        *******************
        0.7.1 (02-18-2019)
        *******************
        
        Important: Python2 is no longer supported.
        Please consider moving to Python3.
        
        Bugfix
        ======
        
        * Included docs in the index whihch were missed
          out in the previous release
        
        
        *******************
        0.7.0 (02-08-2019)
        *******************
        
        New methods/functionality
        =========================
        * `gsm-to-srr`: convert GSM to SRR
        * `gsm-to-srx`: convert GSM to SRX
        * `gsm-to-gse`: convert GSM to GSE
        
        
        Renamed methods
        ===============
        
        The following commad line options have been renamed
        and the changes are not compatible with 0.6.0
        release:
        
        * `sra-metadata` -> `metadata`.
        * `sra-search` -> `search`.
        * `srametadb` -> `metadb`.
        
        
        
        *******************
        0.6.0 (12-25-2018)
        *******************
        
        Bugfix
        ======
        
        * Fixed bugs introduced in 0.5.0 with API changes where
          multiple redundant columns were output in `sra-metadata`
        
        
        New methods/functionality
        =========================
        * `download` now allows piped inputs
        
        
        
        
        *******************
        0.5.0 (12-24-2018)
        *******************
        
        New methods/functionality
        =========================
        * Support for filtering by SRX Id for SRA downloads.
        * `srr_to_srx`: Convert SRR to SRX/SRP
        * `srp_to_srx`: Convert SRP to SRX
        * Stripped down `sra-metadata` to give minimal information
        * Added `--assay`, `--desc`, `--detailed` flag for `sra-metadata`
        * Improved table printing on terminal
        
        
        *******************
        0.4.2 (12-16-2018)
        *******************
        
        Bugfix
        ======
        
        * Fixed unicode error in tests for Python2
        
        
        *******************
        0.4.0 (12-12-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * Added a new `BASEdb` class to handle common database connections
        * Initial support for GEOmetadb through GEOdb class
        * Initial support or a command line interface:
          - download      Download SRA project (SRPnnnn)
          - gse-metadata  Fetch metadata for GEO ID (GSEnnnn)
          - gse-to-gsm    Get GSM(s) for GSE
          - gsm-metadata  Fetch metadata for GSM ID (GSMnnnn)
          - sra-metadata  Fetch metadata for SRA project (SRPnnnn)
        * Added three separate notebooks for SRAdb, GEOdb, CLI usage
        
        *******************
        0.3.0 (12-05-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * `sample_attribute` and `experiment_attribute` are now included by default in the df returned by `sra_metadata()`
        * `expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute` column
        *  New methods to guess cell/tissue/strain: `guess_cell_type()`/`guess_tissue_type()`/`guess_strain_type()`
        *  Improved README and usage instructions
        
        
        *******************
        0.2.2 (12-03-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * `search_sra()` allows full text search on SRA metadata.
        
        
        *******************
        0.2.0 (12-03-2018)
        *******************
        
        Renamed methods
        ===============
        
        The following methods have been renamed
        and the changes are not compatible with 0.1.0
        release:
        
        * `get_query()` -> `query()`.
        * `sra_convert()` -> `sra_metadata()`.
        * `get_table_counts()` -> `all_row_counts()`.
        
        
        New methods/functionality
        =========================
        
        * `download_sradb_file()` makes fetching `SRAmetadb.sqlite` file easy; wget is no longer
          required.
        * `ftp` protocol is now supported besides `fsp` and hence `aspera-client` is now optional.
          We however, strongly recommend `aspera-client` for faster downloads.
        
        Bug fixes
        =========
        * Silenced `SettingWithCopyWarning` by excplicitly doing operations on a copy of
          the dataframe instead of the original.
        
        Besides these, all methods now follow a `numpydoc` compatible documentation.
        
        
        ******************
        0.1.0 (12-01-2018)
        ******************
        
        * First release on PyPI.
        
Keywords: pysradb
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3
