Metadata-Version: 2.1
Name: pytd
Version: 1.1.0
Summary: Treasure Data Driver for Python
Home-page: https://github.com/treasure-data/pytd
Author: Arm Treasure Data
Author-email: support@treasure-data.com
Maintainer: Arm Treasure Data
Maintainer-email: support@treasure-data.com
License: Apache License 2.0
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Topic :: Database
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/x-rst
Provides-Extra: spark
Provides-Extra: test
Provides-Extra: doc
Requires-Dist: urllib3 (<1.25,>=1.21.1)
Requires-Dist: presto-python-client (>=0.6.0)
Requires-Dist: pandas (>=0.24.0)
Requires-Dist: td-client (>=1.1.0)
Requires-Dist: pytz (>=2018.5)
Provides-Extra: doc
Requires-Dist: sphinx (>=2.2.0); extra == 'doc'
Requires-Dist: sphinx-rtd-theme; extra == 'doc'
Requires-Dist: numpydoc; extra == 'doc'
Requires-Dist: ipython; extra == 'doc'
Provides-Extra: spark
Requires-Dist: td-pyspark (>=19.9.0); extra == 'spark'
Requires-Dist: pyspark (>=2.4.0); extra == 'spark'
Requires-Dist: pyarrow (>=0.11.0); extra == 'spark'
Provides-Extra: test
Requires-Dist: pytest; extra == 'test'

pytd
====

|Build Status| |Build status| |PyPI version| |docs status|

**pytd** provides user-friendly interfaces to Treasure Data’s `REST
APIs <https://github.com/treasure-data/td-client-python>`__, `Presto
query
engine <https://support.treasuredata.com/hc/en-us/articles/360001457427-Presto-Query-Engine-Introduction>`__,
and `Plazma primary
storage <https://www.slideshare.net/treasure-data/td-techplazma>`__.

The seamless connection allows your Python code to efficiently
read/write a large volume of data from/to Treasure Data. Eventually,
pytd makes your day-to-day data analytics work more productive.

Installation
------------

.. code:: sh

   pip install pytd

Usage
-----

-  `Documentation <https://pytd-doc.readthedocs.io/>`__
-  `Sample usage on Google
   Colaboratory <https://colab.research.google.com/drive/1ps_ChU-H2FvkeNlj1e1fcOebCt4ryN11>`__

Set your `API
key <https://support.treasuredata.com/hc/en-us/articles/360000763288-Get-API-Keys>`__
and
`endpoint <https://support.treasuredata.com/hc/en-us/articles/360001474288-Sites-and-Endpoints>`__
to the environment variables, ``TD_API_KEY`` and ``TD_API_SERVER``,
respectively, and create a client instance:

.. code:: py

   import pytd

   client = pytd.Client(database='sample_datasets')
   # or, hard-code your API key, endpoint, and/or query engine:
   # >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

Query in Treasure Data
~~~~~~~~~~~~~~~~~~~~~~

Issue Presto query and retrieve the result:

.. code:: py

   client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
   # {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

In case of Hive:

.. code:: py

   client.query('select hivemall_version()', engine='hive')
   # {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

It is also possible to explicitly initialize ``pytd.Client`` for Hive:

.. code:: py

   client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
   client_hive.query('select hivemall_version()')

Write data to Treasure Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data represented as ``pandas.DataFrame`` can be written to Treasure Data
as follows:

.. code:: py

   import pandas as pd

   df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
   client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

For the ``writer`` option, pytd supports three different ways to ingest
data to Treasure Data:

1. **Bulk Import API**: ``bulk_import`` (default)

   -  Convert data into a CSV file and upload in the batch fashion.

2. **Presto INSERT INTO query**: ``insert_into``

   -  Insert every single row in ``DataFrame`` by issuing an INSERT INTO
      query through the Presto query engine.
   -  Recommended only for a small volume of data.

3. `td-spark <https://support.treasuredata.com/hc/en-us/articles/360001487167-Apache-Spark-Driver-td-spark-FAQs>`__:
   ``spark``

   -  Local customized Spark instance directly writes ``DataFrame`` to
      Treasure Data’s primary storage system.

Characteristics of each of these methods can be summarized as follows:

+-----------------------------------+------------------+------------------+-----------+
|                                   | ``bulk_import``  | ``insert_into``  | ``spark`` |
+===================================+==================+==================+===========+
| Scalable against data volume      |        ✓         |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Write performance for larger data |                  |                  |     ✓     |
+-----------------------------------+------------------+------------------+-----------+
| Memory efficient                  |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Disk efficient                    |                  |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+
| Minimal package dependency        |        ✓         |        ✓         |           |
+-----------------------------------+------------------+------------------+-----------+

Enabling Spark Writer
^^^^^^^^^^^^^^^^^^^^^

Since td-spark gives special access to the main storage system via
`PySpark <https://spark.apache.org/docs/latest/api/python/index.html>`__,
follow the instructions below:

1. Contact support@treasuredata.com to activate the permission to your
   Treasure Data account.
2. Install pytd with ``[spark]`` option if you use the third option:
   ``pip install pytd[spark]``

If you want to use existing td-spark JAR file, creating ``SparkWriter``
with ``td_spark_path`` option would be helpful.

.. code:: py

   from pytd.writer import SparkWriter

   writer = SparkWriter(apikey='1/XXX', endpoint='https://api.treasuredata.com/', td_spark_path='/path/to/td-spark-assembly.jar')
   client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

How to replace pandas-td
------------------------

**pytd** offers
`pandas-td <https://github.com/treasure-data/pandas-td>`__-compatible
functions that provide the same functionalities more efficiently. If you
are still using pandas-td, we recommend you to switch to **pytd** as
follows.

First, install the package from PyPI:

.. code:: sh

   pip install pytd
   # or, `pip install pytd[spark]` if you wish to use `to_td`

Next, make the following modifications on the import statements.

*Before:*

.. code:: python

   import pandas_td as td

.. code:: python

   In [1]: %%load_ext pandas_td.ipython

*After:*

.. code:: python

   import pytd.pandas_td as td

.. code:: python

   In [1]: %%load_ext pytd.pandas_td.ipython

Consequently, all ``pandas_td`` code should keep running correctly with
``pytd``. Report an issue from
`here <https://github.com/treasure-data/pytd/issues/new>`__ if you
noticed any incompatible behaviors.

.. |Build Status| image:: https://travis-ci.org/treasure-data/pytd.svg?branch=master
   :target: https://travis-ci.org/treasure-data/pytd
.. |Build status| image:: https://ci.appveyor.com/api/projects/status/h1os6uvl598o7cau?svg=true
   :target: https://ci.appveyor.com/project/takuti/pytd
.. |PyPI version| image:: https://badge.fury.io/py/pytd.svg
   :target: https://badge.fury.io/py/pytd
.. |docs status| image:: https://readthedocs.org/projects/pytd-doc/badge/?version=latest
   :target: https://pytd-doc.readthedocs.io/en/latest/?badge=latest


