Metadata-Version: 2.1
Name: trifacta
Version: 8.5.1
Summary: Python SDK for Trifacta
Home-page: https://www.trifacta.com
Author: Trifacta Inc
Author-email: support@trifacta.com
License: UNKNOWN
Keywords: dataprep preparation wrangle wrangling wrangler trifacta
Platform: UNKNOWN
Requires-Python: >3.6, < 3.9
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: pandas (>=1.1.2)
Requires-Dist: numpy (>=1.17.0)
Requires-Dist: requests (>=2.22.0)
Requires-Dist: regex (>=2020.10.11)
Requires-Dist: python-slugify (>=4.0.1)
Requires-Dist: tqdm (>=4.55.1)
Requires-Dist: boto3 (>=1.17.25)
Requires-Dist: ipywidgets (>=7.6.3)
Requires-Dist: ijson (>=3.1.4)
Requires-Dist: simplejson (>=3.16.0)
Requires-Dist: pywebhdfs (>=0.4.1)
Requires-Dist: metaphone (>=0.6)

# Python SDK for Trifacta

Integrate your Python-based environment with Trifacta to rapidly transform your datasets. Please complete the following steps to install and set up the Python SDK for Trifacta.

## Availability

- **Alpha** release - software and supported capabilities may be changed without notice. Do not deploy in product environments.
- Available for the following product editions:
  - Trifacta Enterprise
  - Trifacta Professional
  - Trifacta Premium

## Limitations

- Some Wrangle functions and transformations are not supported by Python Pandas. Known limitations:
  - NUMFORMAT function
  - String comparison functions
- Transformations that use Array or Map data types are not supported for Python Pandas generation.
- Uploaded files must be in CSV file format.

## Pre-Requisites

### Assumptions

- Listed commands are for Mac OSX. 
- Examples below assume that you are using Jupyter Notebooks for Python flow development. 

### Trifacta Requirements

- A valid account to a project or workspace for one of the above product editions.
- A valid access token to the project or workspace. Instructions are provided below.
- To export your Trifacta recipe as Python code, a workspace administrator must enable the `Wrangle to Python Conversion` feature in the application. For more information, please visit [Workspace Settings Page](https://docs.trifacta.com/display/AWS/Workspace+Settings+Page).

### Python Requirements

- Python 3.7, Python 3.8
- For version requirements of specific Python components, please see `requirements.txt` in this package.

## Install

- Install `trifacta` using pip:
~~~
   pip install trifacta
~~~


## Configure

### Enable access to your Trifacta workspace

- Login to your Trifacta workspace. 
- In the left menu, select **User menu > Preferences > Access tokens**. 
- To create a new access token, click **Generate new token**. Copy the token to the clipboard. You cannot retrieve a token after you exit the modal.
- Paste this token into a text file. Instructions for using it with the SDK are provided later. 

### Configure Trifacta package

Before you can use it to interact with your Trifacta environment, the Python SDK for Trifacta requires the following configuration:

- In your home directory, create the following configuration file: `.trifacta.py.conf`.
- Open the file in a text editor, and insert following configuration. Replace values as needed:
  ```
  [CONFIGURATION]
  username = <username_for_trifacta_account>  # example: test-user@gmail.com
  endpoint = <uri_for_your_trifacta_worskapce>  # example: https://test-workspace.example.com
  token = <copied_token_from_steps_above>
  ```
- Save the file.

## Use

For this release, you can use the Python SDK to upload a CSV file for transformation in a new file. Additional file formats and workflows may be supported in future releases.

### Upload to new flow

- Create a new python3 notebook and import the `trifacta` module:
  ```
  import trifacta as tf
  ```
  `tf` is your handler for interacting with your Trifacta workspace.
- Insert the following code, which uploads a specified CSV for transformation in Trifacta: 
  ```
  import pandas as pd
  df = pd.read_csv(<path_to_csv_dataset>)
  wf = tf.wrangle(df)
  ```
  The `wrangle` function lets you upload a dataset to Trifacta and create a flow for it. This flow is then available through the Trifacta application, where you can transform the dataset through the user interface. `wf` is returned as a handle for the created flow with which
  you can perform other operations on your dataset.

- Run the notebook.

### Launch Trifacta in browser

- After the upload completes, execute the following to open Trifacta in a browser window.
  ```
  wf.open()
  ```
- In the Trifacta window, navigate to the flow that was created. This flow is likely to be named `Untitled` and to be listed in the Flows page at the top when sorted by timestamp. 
- In the created flow, create a recipe connected to your imported dataset.
- Edit the recipe. 
- In the Transformer page, you can identify issues with your dataset and add transformations to your recipe through a point-and-click interface. Click **Add** to add the corresponding transformation step to your recipe. For more information on using Trifacta, please visit [Trifacta Documentation](https://docs.trifacta.com).
- When you have finished defining your recipe steps, return to your Python notebook window.

### Generate Pandas code

- In the Python SDK, you use the `get_pandas()` method to export the Wrangle recipe steps to Python code.
- **NOTE**: `Wrangle to Python Conversion` setting must be enabled in Trifacta by your workspace administrator. See above.
- Use the following to get pandas code for the recipe that you created in Trifacta. This code can be applied to transform
  your `Pandas DataFrame`.
  ```
  wf.get_pandas(add_to_next_cell=True,recipe_name='<my_recipe>')
  ```
  `get_pandas` translates Trifacta's transform recipe into pandas code. 
  `add_to_next_cell` set to `True` ensures that the generated code is added to the next cell of notebook.
  `<recipe_name>` can be specified to generate pandas code for a specific recipe. If not specified, code is generated for the default recipe. To retrieve a list of available recipes, use `wf.recipe_names()`.
- Execute the generated code.
- In a new cell perform the following actions to transform the dataframe using
  above generated Pandas code.
  ```
  wrangled_df = run_transforms(df)
  wrangled_df
  ```
- Above returns the output of your cleansed/transformed pandas dataframe.

## Examples

### Wrangle multiple datasets

The following example describes how to wrangle multiple datasets. In this example, `violations` and `violations_actions` are reference names for `violations_df` and `violations_actions_df` respectively. This mapping enables users to correctly map Pandas DataFrames to arguments/variables in generated Pandas code for the Wrangle recipe.

  ```
  import pandas as pd
  import trifacta as tf
  violations_df = pd.read_csv('../test/data/violations.csv')
  violations_actions_df = pd.read_csv('../test/data/violations_actions.csv')
  wrangle_flow = tf.wrangle((violations_df, 'violations'), (violations_actions_df, 'violations_actions'), flow_name='Example Flow')
  ```

### Wrangle existing flow


From your notebook, you can also begin wrangling an existing flow. The following example launches the Trifacta application in a flow whose internal identifier (`flow_id`) is `13`. 

  ```
  import trifacta as tf
  import pandas as pd
  flow_id = 13
  wf = tf.wrangle_existing(flow_id)
  # Following call opens the flow corresponding to 'flow_id' in Flow View. If a 'recipe_name' arg is provided, the recipe is opened in the Transformer page. 
  wf.open()
  ```

For additional examples, please see the `notebooks` directory in this package.

## Wrangle function reference

The following wrangling functions are available through the SDK. 

### Trifacta module functions

`tf` is an alias to the Trifacta module.

| Method           | Description | Arguments |
| -----------      | ----------- | ----------- |
| `tf.wrangle(*datasets)`      | Upload one ore more datasets to the Trifacta application and create a flow for it.   This flow is then available through the Trifacta application, where you can transform the dataset through the user interface. | **\*datasets**: Pandas DataFrames to be wrangled.   It could also be a tuple, where the first element in the tuple is a Pandas DataFrame, and second element is the reference name (string) for the DataFrame. |

### WrangleFlow module functions

All of the below functions are available for the `WrangleFlow` object in your Python environment. So, you must call them using a `WrangleFlow` object.

`wf` is a reference to the `WrangleFlow` object.

| Method           | Description | Arguments |
| -----------      | ----------- | ----------- |
| `wf.add_datasets(*datasets)` | Add Pandas DataFrames to a flow, where `datasets` is a list of DataFrames. | **\*datasets**: Pandas DataFrames to be added to a flow.    It could also be a tuple, where the first element in the tuple is a Pandas DataFrame, and second element is the reference name (string) for the DataFrame. |
| `get_pandas(add_to_next_cell=False, recipe_name=None)` | Generates Python Pandas code for your Wrangle recipe. | **add_to_next_cell:** Set it to True, if you're using Jupyter Notebook and would like to add the generated Pandas code to be added to next cell. If False, the Pandas code is returned as string.   **recipe_name:** Recipe for which you want to get the Pandas code. If not specified, the default recipe is used. Use `wf.recipe_names()` to retrieve available recipes. |
| `wf.run_job(pbar=None, execution='photon', recipe_name=None)` | Run a job for a specified recipe. | **pbar**: can be ignored.   **execution**: Running environment in the Trifacta platform where you want to execute the job. Possible values: `photon` or `emrSpark`.   **recipe_name**: Recipe for which you want to execute the job. If set to `None`, input is the default recipe. |
| `wf.profile(recipe_name=None)` | Generate a profile for a specified recipe. | **recipe_name**: Recipe for which you want to generate profile. If set to `None`, input is the default recipe. |
| `wf.recipe_names()` | Lists the recipe names for the recipe present in the Trifacta application. | N/A |
| `wf.open_profile(recipe_name=None)` | Open a profile that you have previously generated for the specified recipe. | **recipe_name**: Recipe for which you want to open the profile. If set to `None`, input is the default recipe. |


### Data profiling functions

The SDK enables generation of data profiles based on the output of your Trifacta recipe: 

| Method           | Description | Arguments |
| -----------      | ----------- | ----------- |
| `summary(recipe_name=None)`      | Returns a table of summary statistics per column  | **recipe_name**: Recipe name for which you want to generate the summary. If set to `None`, input is the default recipe. |
| `dq_bars(show_types=True, recipe_name=None)`       | Returns the valid/invalid/missing ratio per column | **show_types**: Show column types information along with data quality bars for the column.   **recipe_name**: Recipe name for which you want to generate the data quality bar. If set to `None`, input is the default recipe. |
| `col_types(recipe_name=None)`     | Lists the inferred data type for each column | **recipe_name**: Recipe name for which you want to infer data types for each column. If set to `None`, input is the default recipe. |
| `bars_df_list(recipe_name)`   | Returns a list of dataframes, one per column, representing a bar-chart for that column | N/A |
| `pdf_profile(filename=None, recipe_name=None)`   | Returns a snazzy PDF report with all the statistics | **filename**: Name of the file to which PDF profile results are written. If set to `None`, results are returned back from the function.   **recipe_name**: Recipe for which you want to generate PDF profile results. If set to `None`, results are generated for the default recipe. |


