Metadata-Version: 1.2
Name: pysradb
Version: 0.7.1
Summary: Python package for interacting with SRAdb and downloading datasets from SRA
Home-page: https://github.com/saketkc/pysradb
Author: Saket Choudhary
Author-email: saketkc@gmail.com
License: BSD license
Description: #######
        pysradb
        #######
        
        .. image:: https://zenodo.org/badge/159590788.svg
           :target: https://zenodo.org/badge/latestdoi/159590788
        
        .. image:: https://img.shields.io/pypi/v/pysradb.svg?style=flat-square
                :target: https://pypi.python.org/pypi/pysradb
        
        .. image:: https://img.shields.io/travis-ci/saketkc/pysradb.svg?style=flat-square
                :target: https://travis-ci.com/saketkc/pysradb
        
        .. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
                :target: http://bioconda.github.io/recipes/pysradb/README.html
        
        .. image:: https://codecov.io/gh/saketkc/pysradb/branch/master/graph/badge.svg?style=flat-square
                :target: https://codecov.io/gh/saketkc/pysradb
        
        Python package for interacting with SRAdb and downloading datasets from SRA.
        (python3 only!)
        
        .. raw:: html
        
            <a href="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx?speed=5&autoplay=1" target="_blank"><img src="https://asciinema.org/a/0C3SjYmPTkkemldprUpdVhiKx.svg" /></a>
        
        
        *********
        CLI Usage
        *********
        
        `pysradb` now supports command line usage (as seen in the demo cinema above). The documentation
        is in progress. See  `cmdline <https://github.com/saketkc/pysradb/blob/master/docs/cmdline.rst>`_ for
        some quick usage instructions. See `quickstart <https://www.saket-choudhary.me/pysradb/quickstart.html#the-full-list-of-possible-pysradb-operations>`_ for
        a list of instructions for each sub-command.
        
        
        ::
        
           $ pysradb
        
            Usage: pysradb [OPTIONS] COMMAND [ARGS]...
        
              pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.
        
              Citation: Pending.
        
            Options:
              --version   Show the version and exit.
              -h, --help  Show this message and exit.
        
            Commands:
              download    Download SRA project (SRPnnnn)
              gse-to-gsm  Get GSM for a GSE
              gse-to-srp  Get SRP for a GSE
              gsm-to-gse  Get GSE for a GSM
              gsm-to-srp  Get SRP for a GSM
              gsm-to-srr  Get SRR for a GSM
              gsm-to-srx  Get SRX for a GSM
              metadata    Fetch metadata for SRA project (SRPnnnn)
              metadb      Download SRAmetadb.sqlite
              search      Search SRA for matching text
              srp-to-gse  Get GSE for a SRP
              srp-to-srr  Get SRR for a SRP
              srp-to-srs  Get SRS for a SRP
              srp-to-srx  Get SRX for a SRP
              srr-to-srp  Get SRP for a SRR
              srr-to-srs  Get SRS for a SRR
              srr-to-srx  Get SRX for a SRR
              srs-to-srx  Get SRX for a SRS
              srx-to-srp  Get SRP for a SRX
              srx-to-srr  Get SRR for a SRX
              srx-to-srs  Get SRS for a SRX
        
        **************
        Demo Notebooks
        **************
        
        These notebooks document all the possible features of `pysradb`:
        
        1. `SRAmetadb operations <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/01.SRAdb-demo.ipynb>`_
        2. `GEOmetadb operations <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/02.GEOmetadb-demo.ipynb>`_
        3. `Command line usage <https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/03.CommandLine-demo.ipynb>`_
        
        
        ************
        Installation
        ************
        
        
        To install stable version using `pip`:
        
        .. code-block:: bash
        
           pip install pysradb
        
        Alternatively, if you use conda:
        
        .. code-block:: bash
        
           conda install -c bioconda pysradb
        
        This step will install all the dependencies except aspera-client_ (which is not required, but highly recommended).
        Both Python 2 and Python 3 are supported.
        
        
        Dependecies
        ===========
        
        .. code-block:: bash
        
           pandas>=0.23.4
           tqdm>=4.28
           click>=7.0
           aspera-client
           SRAmetadb.sqlite
        
        Downloading SRAmetadb
        =====================
        
        We need a SQLite database file that has preprocessed metadata made available by the
        `SRAdb <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-19>`_ project.
        
        SRAmetadb can be downloaded using:
        
        .. code-block:: bash
        
           wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz
        
        Alternatively, you can also download it using `pysradb`:
        
        
        .. code-block:: python
        
           from pysradb import download_sradb_file
           download_sradb_file()
        
           SRAmetadb.sqlite.gz: 2.44GB [01:10, 36.9MB/s]
        
        
        .. _aspera-client:
        
        
        aspera-client
        =============
        
        We strongly recommend using `aspera-client` (which uses UDP) since it `warrants faster downloads <http://www.skullbox.net/tcpudp.php>`_ as compared to `ftp/http` based downloads.
        
        PDF intructions are available on IBM's `website <https://downloads.asperasoft.com/connect2/>`_.
        
        Direct download links:
        
        - Linux: https://download.asperasoft.com/download/sw/connect/3.8.1/ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
        - MacOS: https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnectInstaller-3.8.1.161274.dmg
        - Windows: https://download.asperasoft.com/download/sw/connect/3.8.1/IBMAsperaConnect-ML-3.8.1.161274.msi
        
        Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:
        
        .. code-block:: bash
        
           tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
           bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
           Installing IBM Aspera Connect
           Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
           Install complete.
        
        
        Installing pysradb in development mode
        ======================================
        
        .. code-block:: bash
        
           pip install -U pandas tqdm
           git clone https://github.com/saketkc/pysradb.git
           cd pysradb
           pip install -e .
        
        
        
        
        ********************
        Using the Python API
        ********************
        
        Use Case 1: Fetch the metadata table (SRA-runtable)
        ===================================================
        
        The simplest use case of `pysradb` is when you apriopri know the SRA project ID (SRP)
        and would simply want to fetch the metadata associated with it. This is generally
        reflected in the `SraRunTable.txt` that you get from NCBI's website.
        See an `example <https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP098789>`_ of a SraRunTable.
        
        
        .. code-block:: python
        
           from pysradb import SRAdb
           db = SRAdb('SRAmetadb.sqlite')
           df = db.sra_metadata('SRP098789')
           df.head()
        
        .. table::
        
            ===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============
            study_accession  experiment_accession                             experiment_title                             run_accession  taxon_id  library_selection  library_layout  library_strategy  library_source  library_name    bases      spots    adapter_spec  avg_read_length
            ===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============
            SRP098789        SRX2536403            GSM2475997: 1.5 Ã‚ÂµM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER  SRR5227288         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2104142750  42082855                             50
            SRP098789        SRX2536404            GSM2475998: 1.5 Ã‚ÂµM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER  SRR5227289         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2082873050  41657461                             50
            SRP098789        SRX2536405            GSM2475999: 1.5 Ã‚ÂµM PF-067446846, 10 min, rep 3; Homo sapiens; OTHER  SRR5227290         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2023148650  40462973                             50
            SRP098789        SRX2536406            GSM2476000: 0.3 Ã‚ÂµM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER  SRR5227291         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                2057165950  41143319                             50
            SRP098789        SRX2536407            GSM2476001: 0.3 Ã‚ÂµM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER  SRR5227292         9606  other              SINGLE -        OTHER             TRANSCRIPTOMIC                3027621850  60552437                             50
            ===============  ====================  ======================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========  ============  ===============
        
        The metadata is returned as a `pandas` dataframe and hence allows you to perform
        all regular select/query operations available through `pandas`.
        
        
        
        Use Case 2: Downloading an entire project arranged experiment wise
        ==================================================================
        
        Once you have fetched the metadata and made sure, this is the project
        you were looking for, you would want to download everything at once.
        NCBI follows this hiererachy: `SRP => SRX => SRR`. Each `SRP` (project) has multiple
        `SRX` (experiments) and each `SRX` in turn has multiple `SRR` (runs) inside it.
        We want to mimick this hiereachy in our downloads. The reason to do that is simple:
        in most cases you care about `SRX` the most, and would want to "merge" your SRRs
        in one way or the other. Having this hierearchy ensures your downstream code
        can handle such cases easily, without worrying about which runs (SRR) need to be merged.
        
        We strongly recommend installing `aspera-client` which uses UDP and is `designed to be faster <http://www.skullbox.net/tcpudp.php>`_.
        
        .. code-block:: python
        
           from pysradb import SRAdb
           db = SRAdb('SRAmetadb.sqlite')
           df = db.sra_metadata('SRP017942')
           db.download(df)
        
        The default download location is `pysradb_downloads/` created inside your current working directory.
        You can specify a location by:
        
        .. code-block:: python
        
           db.download(df=df, out_dir='/pysradb_downloads')
        
        
        
        Use Case 3: Downloading a subset of experiments
        ===============================================
        
        Often, you need to process only a smaller set of samples from a project (SRP).
        Consider this project which has data spanning four assays.
        
        .. code-block:: python
        
           df = db.sra_metadata('SRP000941')
           print(df.library_strategy.unique())
           ['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']
        
        
        But, you might be only interested in analyzing the `RNA-seq` samples and would just want to download that subset.
        This is simple using `pysradb` since the metadata can be subset just as you would subset a dataframe in
        pandas.
        
        
        .. code-block:: python
        
           df_rna = df[df.library_strategy == 'RNA-Seq']
           db.download(df=df_rna, out_dir='/pysradb_downloads')
        
        
        Use Case 4: Getting cell-type/treatment information from sample_attributes
        ==========================================================================
        
        Cell type/tissue informations is usually hidden in the `sample_attributes` column,
        which can be expanded:
        
        .. code-block:: python
        
           from pysradb.filter_attrs import expand_sample_attribute_columns
           df = db.sra_metadata('SRP017942')
           expand_sample_attribute_columns(df).head()
        
        
        .. table::
        
            ===============  ====================  =====================================================================  =========================  ========================================================================================================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  =========  ============  ===============  ==========  ==========  ===========  ================  ===============================
            study_accession  experiment_accession                            experiment_title                               experiment_attribute                                                                         sample_attribute                                                                      run_accession  taxon_id  library_selection  library_layout  library_strategy  library_source  library_name    bases       spots    adapter_spec  avg_read_length  assay_type  cell_line   source_name  transfected_with             treatment
            ===============  ====================  =====================================================================  =========================  ========================================================================================================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  =========  ============  ===============  ==========  ==========  ===========  ================  ===============================
            SRP017942        SRX217028             GSM1063575: 293T_GFP; Homo sapiens; RNA-Seq                            GEO Accession: GSM1063575  source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || assay type: Riboseq                                                   SRR648667          9606  other              SINGLE -        RNA-Seq           TRANSCRIPTOMIC                1806641316   50184481                             36  riboseq     293t cells  293t cells   3xflag-gfp        NaN
            SRP017942        SRX217029             GSM1063576: 293T_GFP_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq     GEO Accession: GSM1063576  source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-GFP || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq     SRR648668          9606  other              SINGLE -        RNA-Seq           TRANSCRIPTOMIC                3436984836   95471801                             36  riboseq     293t cells  293t cells   3xflag-gfp        severe heat shock (44c 2 hours)
            SRP017942        SRX217030             GSM1063577: 293T_Hspa1a; Homo sapiens; RNA-Seq                         GEO Accession: GSM1063577  source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || assay type: Riboseq                                                SRR648669          9606  other              SINGLE -        RNA-Seq           TRANSCRIPTOMIC                3330909216   92525256                             36  riboseq     293t cells  293t cells   3xflag-hspa1a     NaN
            SRP017942        SRX217031             GSM1063578: 293T_Hspa1a_2hrs_severe_Heat_Shock; Homo sapiens; RNA-Seq  GEO Accession: GSM1063578  source_name: 293T cells || cell line: 293T cells || transfected with: 3XFLAG-Hspa1a || treatment: severe heat shock (44C 2 hours) || assay type: Riboseq  SRR648670          9606  other              SINGLE -        RNA-Seq           TRANSCRIPTOMIC                3622123512  100614542                             36  riboseq     293t cells  293t cells   3xflag-hspa1a     severe heat shock (44c 2 hours)
            SRP017942        SRX217956             GSM794854: 3T3-Control-Riboseq; Mus musculus; RNA-Seq                  GEO Accession: GSM794854   source_name: 3T3 cells || treatment: control || cell line: 3T3 cells || assay type: Riboseq                                                               SRR649752         10090  cDNA               SINGLE -        RNA-Seq           TRANSCRIPTOMIC                 594945396   16526261                             36  riboseq     3t3 cells   3t3 cells    NaN               control
            ===============  ====================  =====================================================================  =========================  ========================================================================================================================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  =========  ============  ===============  ==========  ==========  ===========  ================  ===============================
        
        
        Use Case 5: Searching for datasets
        ==================================
        
        Another common operation that we do on SRA is seach, plain text search.
        
        
        If you want to look up for all projects where `ribosome profiling` appears somewhere
        in the description:
        
        .. code-block:: python
        
        
           df = db.search_sra(search_str='"ribosome profiling"')
           df.head()
        
        .. table::
        
            ===============  ====================  =======================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========
            study_accession  experiment_accession                     experiment_title                      run_accession  taxon_id  library_selection  library_layout  library_strategy  library_source  library_name    bases      spots
            ===============  ====================  =======================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========
            DRP003075        DRX019536             Illumina Genome Analyzer IIx sequencing of SAMD00018584  DRR021383         83333  other              SINGLE -        OTHER             TRANSCRIPTOMIC  GAII05_3       978776480  12234706
            DRP003075        DRX019537             Illumina Genome Analyzer IIx sequencing of SAMD00018585  DRR021384         83333  other              SINGLE -        OTHER             TRANSCRIPTOMIC  GAII05_4       894201680  11177521
            DRP003075        DRX019538             Illumina Genome Analyzer IIx sequencing of SAMD00018586  DRR021385         83333  other              SINGLE -        OTHER             TRANSCRIPTOMIC  GAII05_5       931536720  11644209
            DRP003075        DRX019540             Illumina Genome Analyzer IIx sequencing of SAMD00018588  DRR021387         83333  other              SINGLE -        OTHER             TRANSCRIPTOMIC  GAII07_4      2759398700  27593987
            DRP003075        DRX019541             Illumina Genome Analyzer IIx sequencing of SAMD00018589  DRR021388         83333  other              SINGLE -        OTHER             TRANSCRIPTOMIC  GAII07_5      2386196500  23861965
            ===============  ====================  =======================================================  =============  ========  =================  ==============  ================  ==============  ============  ==========  ========
        
        Again, the results are available as a `pandas` dataframe and hence
        you can perform all subset operations post your query. Your query doesn't need
        to be exact.
        
        
        
        
        ********
        Citation
        ********
        
        Zenodo archive: https://zenodo.org/badge/latestdoi/159590788
        
        DOI: 10.5281/zenodo.2306881
        
        A lot of functionality in `pysradb` is based on ideas from the original `SRAdb package <https://bioconductor.org/packages/release/bioc/html/SRAdb.html>`_. Please cite the original SRAdb publication:
        
            Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. "SRAdb: query and use public next-generation sequencing data from within R." BMC bioinformatics 14, no. 1 (2013): 19.
        
        
        
        
        
        * Free software: BSD license
        * Documentation: https://saketkc.github.io/pysradb
        
        
        
        #######
        History
        #######
        
        
        *******************
        0.7.1 (02-18-2019)
        *******************
        
        Important: Python2 is no longer supported.
        Please consider moving to Python3.
        
        Bugfix
        ======
        
        * Included docs in the index whihch were missed
          out in the previous release
        
        
        *******************
        0.7.0 (02-08-2019)
        *******************
        
        New methods/functionality
        =========================
        * `gsm-to-srr`: convert GSM to SRR
        * `gsm-to-srx`: convert GSM to SRX
        * `gsm-to-gse`: convert GSM to GSE
        
        
        Renamed methods
        ===============
        
        The following commad line options have been renamed
        and the changes are not compatible with 0.6.0
        release:
        
        * `sra-metadata` -> `metadata`.
        * `sra-search` -> `search`.
        * `srametadb` -> `metadb`.
        
        
        
        *******************
        0.6.0 (12-25-2018)
        *******************
        
        Bugfix
        ======
        
        * Fixed bugs introduced in 0.5.0 with API changes where
          multiple redundant columns were output in `sra-metadata`
        
        
        New methods/functionality
        =========================
        * `download` now allows piped inputs
        
        
        
        
        *******************
        0.5.0 (12-24-2018)
        *******************
        
        New methods/functionality
        =========================
        * Support for filtering by SRX Id for SRA downloads.
        * `srr_to_srx`: Convert SRR to SRX/SRP
        * `srp_to_srx`: Convert SRP to SRX
        * Stripped down `sra-metadata` to give minimal information
        * Added `--assay`, `--desc`, `--detailed` flag for `sra-metadata`
        * Improved table printing on terminal
        
        
        *******************
        0.4.2 (12-16-2018)
        *******************
        
        Bugfix
        ======
        
        * Fixed unicode error in tests for Python2
        
        
        *******************
        0.4.0 (12-12-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * Added a new `BASEdb` class to handle common database connections
        * Initial support for GEOmetadb through GEOdb class
        * Initial support or a command line interface:
          - download      Download SRA project (SRPnnnn)
          - gse-metadata  Fetch metadata for GEO ID (GSEnnnn)
          - gse-to-gsm    Get GSM(s) for GSE
          - gsm-metadata  Fetch metadata for GSM ID (GSMnnnn)
          - sra-metadata  Fetch metadata for SRA project (SRPnnnn)
        * Added three separate notebooks for SRAdb, GEOdb, CLI usage
        
        *******************
        0.3.0 (12-05-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * `sample_attribute` and `experiment_attribute` are now included by default in the df returned by `sra_metadata()`
        * `expand_sample_attribute_columns: expand metadata dataframe based on attributes in `sample_attribute` column
        *  New methods to guess cell/tissue/strain: `guess_cell_type()`/`guess_tissue_type()`/`guess_strain_type()`
        *  Improved README and usage instructions
        
        
        *******************
        0.2.2 (12-03-2018)
        *******************
        
        New methods/functionality
        =========================
        
        * `search_sra()` allows full text search on SRA metadata.
        
        
        *******************
        0.2.0 (12-03-2018)
        *******************
        
        Renamed methods
        ===============
        
        The following methods have been renamed
        and the changes are not compatible with 0.1.0
        release:
        
        * `get_query()` -> `query()`.
        * `sra_convert()` -> `sra_metadata()`.
        * `get_table_counts()` -> `all_row_counts()`.
        
        
        New methods/functionality
        =========================
        
        * `download_sradb_file()` makes fetching `SRAmetadb.sqlite` file easy; wget is no longer
          required.
        * `ftp` protocol is now supported besides `fsp` and hence `aspera-client` is now optional.
          We however, strongly recommend `aspera-client` for faster downloads.
        
        Bug fixes
        =========
        * Silenced `SettingWithCopyWarning` by excplicitly doing operations on a copy of
          the dataframe instead of the original.
        
        Besides these, all methods now follow a `numpydoc` compatible documentation.
        
        
        ******************
        0.1.0 (12-01-2018)
        ******************
        
        * First release on PyPI.
        
Keywords: pysradb
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3
