Metadata-Version: 2.1
Name: pytd
Version: 1.4.0
Summary: Treasure Data Driver for Python
Home-page: https://github.com/treasure-data/pytd
Author: Treasure Data
Author-email: support@treasure-data.com
Maintainer: Treasure Data
Maintainer-email: support@treasure-data.com
License: Apache License 2.0
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Topic :: Database
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/x-rst
Provides-Extra: doc
Provides-Extra: test
Provides-Extra: spark
Requires-Dist: urllib3 (<1.25,>=1.21.1)
Requires-Dist: presto-python-client (>=0.6.0)
Requires-Dist: pandas (>=0.25.0)
Requires-Dist: td-client (>=1.1.0)
Requires-Dist: pytz (>=2018.5)
Provides-Extra: doc
Requires-Dist: sphinx (>=2.2.0); extra == 'doc'
Requires-Dist: sphinx-rtd-theme; extra == 'doc'
Requires-Dist: numpydoc; extra == 'doc'
Requires-Dist: ipython; extra == 'doc'
Provides-Extra: spark
Requires-Dist: td-pyspark (>20.10.0); extra == 'spark'
Requires-Dist: pyspark (>=3.0.0); extra == 'spark'
Requires-Dist: pyarrow (>=0.12.1); extra == 'spark'
Provides-Extra: test
Requires-Dist: pytest; extra == 'test'

pytd
====

|Build status| |PyPI version| |docs status|

**pytd** provides user-friendly interfaces to Treasure Data’s `REST
APIs <https://github.com/treasure-data/td-client-python>`__, `Presto
query
engine <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1083607/Presto+Query+Engine+Introduction>`__,
and `Plazma primary
storage <https://www.slideshare.net/treasure-data/td-techplazma>`__.

The seamless connection allows your Python code to efficiently
read/write a large volume of data from/to Treasure Data. Eventually,
pytd makes your day-to-day data analytics work more productive.

Installation
------------

.. code:: sh

   pip install pytd

Usage
-----

-  `Documentation <https://pytd-doc.readthedocs.io/>`__
-  `Sample usage on Google
   Colaboratory <https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11>`__

Set your `API
key <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1081428/Getting+Your+API+Keys>`__
and
`endpoint <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085143/Sites+and+Endpoints>`__
to the environment variables, ``TD_API_KEY`` and ``TD_API_SERVER``,
respectively, and create a client instance:

.. code:: py

   import pytd

   client = pytd.Client(database='sample_datasets')
   # or, hard-code your API key, endpoint, and/or query engine:
   # >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

Query in Treasure Data
~~~~~~~~~~~~~~~~~~~~~~

Issue Presto query and retrieve the result:

.. code:: py

   client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
   # {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

In case of Hive:

.. code:: py

   client.query('select hivemall_version()', engine='hive')
   # {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

It is also possible to explicitly initialize ``pytd.Client`` for Hive:

.. code:: py

   client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
   client_hive.query('select hivemall_version()')

Write data to Treasure Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data represented as ``pandas.DataFrame`` can be written to Treasure Data
as follows:

.. code:: py

   import pandas as pd

   df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
   client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

For the ``writer`` option, pytd supports three different ways to ingest
data to Treasure Data:

1. **Bulk Import API**: ``bulk_import`` (default)

   -  Convert data into a CSV file and upload in the batch fashion.

2. **Presto INSERT INTO query**: ``insert_into``

   -  Insert every single row in ``DataFrame`` by issuing an INSERT INTO
      query through the Presto query engine.
   -  Recommended only for a small volume of data.

3. `td-spark <https://treasure-data.github.io/td-spark/>`__:
   ``spark``

   -  Local customized Spark instance directly writes ``DataFrame`` to
      Treasure Data’s primary storage system.

Characteristics of each of these methods can be summarized as follows:

+-----------------------------------+------------------+------------------+-----------+
|                                   | ``bulk_import``  | ``insert_into``  | ``spark`` |
+===================================+==================+==================+===========+
| Scalable against data volume      |        ✓         |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Write performance for larger data |                  |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Memory efficient                  |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Disk efficient                    |                  |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Minimal package dependency        |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+

Enabling Spark Writer
^^^^^^^^^^^^^^^^^^^^^

Since td-spark gives special access to the main storage system via
`PySpark <https://spark.apache.org/docs/latest/api/python/index.html>`__,
follow the instructions below:

1. Contact support@treasuredata.com to activate the permission to your
   Treasure Data account.
2. Install pytd with ``[spark]`` option if you use the third option:
   ``pip install pytd[spark]``

If you want to use existing td-spark JAR file, creating ``SparkWriter``
with ``td_spark_path`` option would be helpful.

.. code:: py

   from pytd.writer import SparkWriter

   writer = SparkWriter(td_spark_path='/path/to/td-spark-assembly.jar')
   client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

Comparison between pytd, td-client-python, and pandas-td
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Treasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.

1. `td-client-python <https://github.com/treasure-data/td-client-python>`__

   - Basic REST API wrapper.
   - Similar functionalities to td-client-{`ruby <https://github.com/treasure-data/td-client-ruby>`__, `java <https://github.com/treasure-data/td-client-java>`__, `node <https://github.com/treasure-data/td-client-node>`__, `go <https://github.com/treasure-data/td-client-go>`__}.
   - The capability is limited by `what Treasure Data REST API can do <https://tddocs.atlassian.net/wiki/spaces/PD/pages/1085354/REST+APIs+in+Treasure+Data>`__.

2. **pytd**

   - Access to Plazma via td-spark as introduced above.
   - Efficient connection to Presto based on `presto-python-client <https://github.com/prestodb/presto-python-client>`__.
   - Multiple data ingestion methods and a variety of utility functions.

3. `pandas-td <https://github.com/treasure-data/pandas-td>`__ *(deprecated)*

   - Old tool optimized for `pandas <https://pandas.pydata.org>`__ and `Jupyter Notebook <https://jupyter.org>`__.
   - **pytd** offers its compatible function set (see below for the detail).

An optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:

- Use td-client-python if you want to execute *basic CRUD operations* from Python applications.
- Use **pytd** for (1) *analytical purpose* relying on pandas and Jupyter Notebook, and (2) achieving *more efficient data access* at ease.
- Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.

How to replace pandas-td
^^^^^^^^^^^^^^^^^^^^^^^^

**pytd** offers
`pandas-td <https://github.com/treasure-data/pandas-td>`__-compatible
functions that provide the same functionalities more efficiently. If you
are still using pandas-td, we recommend you to switch to **pytd** as
follows.

First, install the package from PyPI:

.. code:: sh

   pip install pytd
   # or, `pip install pytd[spark]` if you wish to use `to_td`

Next, make the following modifications on the import statements.

*Before:*

.. code:: python

   import pandas_td as td

.. code:: python

   In [1]: %%load_ext pandas_td.ipython

*After:*

.. code:: python

   import pytd.pandas_td as td

.. code:: python

   In [1]: %%load_ext pytd.pandas_td.ipython

Consequently, all ``pandas_td`` code should keep running correctly with
``pytd``. Report an issue from
`here <https://github.com/treasure-data/pytd/issues/new>`__ if you
noticed any incompatible behaviors.

.. |Build status| image:: https://github.com/treasure-data/pytd/workflows/Build/badge.svg
   :target: https://github.com/treasure-data/pytd/actions/
.. |PyPI version| image:: https://badge.fury.io/py/pytd.svg
   :target: https://badge.fury.io/py/pytd
.. |docs status| image:: https://readthedocs.org/projects/pytd-doc/badge/?version=latest
   :target: https://pytd-doc.readthedocs.io/en/latest/?badge=latest


