Metadata-Version: 2.1
Name: datacatalog-util
Version: 0.1.0
Summary: A package to manage Google Cloud Data Catalog helper commands and scripts
Home-page: https://github.com/mesmacosta/datacatalog-util
Author: Marcelo Miranda
Author-email: mesmacosta@gmail.com
License: MIT license
Platform: Posix; MacOS X; Windows
Classifier: Development Status :: 3 - Alpha
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Python: >=3.6
Requires-Dist: google-cloud-datacatalog
Requires-Dist: pandas
Requires-Dist: tabulate
Requires-Dist: datacatalog-tag-manager

datacatalog-util
================

A Python package to manage Google Cloud Data Catalog helper commands and
scripts.

**Disclaimer: This is not an officially supported Google product.**

1. Environment setup
--------------------

1.1. Python + virtualenv
~~~~~~~~~~~~~~~~~~~~~~~~

Using `virtualenv <https://virtualenv.pypa.io/en/latest/>`__ is
optional, but strongly recommended unless you use
`Docker <#12-docker>`__.

1.1.1. Install Python 3.6+
^^^^^^^^^^^^^^^^^^^^^^^^^^

1.1.2. Create a folder
^^^^^^^^^^^^^^^^^^^^^^

This is recommended so all related stuff will reside at same place,
making it easier to follow below instructions.

.. code:: bash

   mkdir ./datacatalog-util
   cd ./datacatalog-util

*All paths starting with ``./`` in the next steps are relative to the
``utilsr`` folder.*

1.1.3. Create and activate an isolated Python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

   pip install --upgrade virtualenv
   python3 -m virtualenv --python python3 env
   source ./env/bin/activate

1.1.4. Install the package
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

   pip install --upgrade .

1.2. Docker
~~~~~~~~~~~

Docker may be used as an alternative to run the script. In this case,
please disregard the `Virtualenv <#11-python--virtualenv>`__ setup
instructions.

1.2.1. Get the source code
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

   git clone https://github.com/mesmacosta/datacatalog-util
   cd ./datacatalog-util

1.3. Auth credentials
~~~~~~~~~~~~~~~~~~~~~

1.3.1. Create a service account and grant it below roles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  BigQuery Metadata Viewer
-  Data Catalog Admin
-  A custom role with ``bigquery.datasets.updateTag`` and
   ``bigquery.tables.updateTag`` permissions

1.3.2. Download a JSON key and save it as
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  ``./credentials/datacatalog-util.json``

1.3.3. Set the environment variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*This step may be skipped if you’re using*\ `Docker <#12-docker>`__\ *.*

.. code:: bash

   export GOOGLE_APPLICATION_CREDENTIALS=~/credentials/datacatalog-util.json

2. Load Tags from CSV file
--------------------------

2.1. Create a CSV file representing the Tags to be created
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tags are composed of as many lines as required to represent all of their
fields. The columns are described as follows:

+---------------------+----------------------------------+-----------+
| Column              | Description                      | Mandatory |
+=====================+==================================+===========+
| **linked_resource** | Full name of the asset the Entry | Y         |
|                     | refers to.                       |           |
+---------------------+----------------------------------+-----------+
| **template_name**   | Resource name of the Tag         | Y         |
|                     | Template for the Tag.            |           |
+---------------------+----------------------------------+-----------+
| **column**          | Attach Tags to a column          | N         |
|                     | belonging to the Entry schema.   |           |
+---------------------+----------------------------------+-----------+
| **field_id**        | Id of the Tag field.             | Y         |
+---------------------+----------------------------------+-----------+
| **field_value**     | Value of the Tag field.          | Y         |
+---------------------+----------------------------------+-----------+

*TIPS* -
`sample-input/create-tags <https://github.com/mesmacosta/datacatalog-util/tree/master/sample-input/create-tags>`__
for reference; - `Data Catalog Sample
Tags <https://docs.google.com/spreadsheets/d/1bqeAXjLHUq0bydRZj9YBhdlDtuu863nwirx8t4EP_CQ>`__
(Google Sheets) may help to create/export the CSV.

2.2. Run the datacatalog-util script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Python + virtualenv

.. code:: bash

   datacatalog-util create-tags --csv-file CSV_FILE_PATH

-  Docker

.. code:: bash

   docker build --rm --tag datacatalog-util .
   docker run --rm --tty \
     --volume CREDENTIALS_FILE_FOLDER:/credentials --volume CSV_FILE_FOLDER:/data \
     datacatalog-util create-tags --csv-file /data/CSV_FILE_NAME

3. Export Tags to CSV file
--------------------------

3.1. A list of CSV files, each representing one Template will be created.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One file with summary with stats about each template, will also be
created on the same directory.

The columns for the summary file are described as follows:

+-----------------------------------+-----------------------------------+
| Column                            | Description                       |
+===================================+===================================+
| **template_name**                 | Resource name of the Tag Template |
|                                   | for the Tag.                      |
+-----------------------------------+-----------------------------------+
| **tags_count**                    | Number of tags found from the     |
|                                   | template.                         |
+-----------------------------------+-----------------------------------+
| **tagged_entries_count**          | Number of tagged entries with the |
|                                   | template.                         |
+-----------------------------------+-----------------------------------+
| **tagged_columns_count**          | Number of tagged columns with the |
|                                   | template.                         |
+-----------------------------------+-----------------------------------+
| **tag_string_fields_count**       | Number of used String fields on   |
|                                   | tags of the template.             |
+-----------------------------------+-----------------------------------+
| **tag_bool_fields_count**         | Number of used Bool fields on     |
|                                   | tags of the template.             |
+-----------------------------------+-----------------------------------+
| **tag_double_fields_count**       | Number of used Double fields on   |
|                                   | tags of the template.             |
+-----------------------------------+-----------------------------------+
| **tag_timestamp_fields_count**    | Number of used Timestamp fields   |
|                                   | on tags of the template.          |
+-----------------------------------+-----------------------------------+
| **tag_enum_fields_count**         | Number of used Enum fields on     |
|                                   | tags of the template.             |
+-----------------------------------+-----------------------------------+

The columns for each template file are described as follows:

+-----------------------------------+-----------------------------------+
| Column                            | Description                       |
+===================================+===================================+
| **relative_resource_name**        | Full resource name of the asset   |
|                                   | the Entry refers to.              |
+-----------------------------------+-----------------------------------+
| **linked_resource**               | Full name of the asset the Entry  |
|                                   | refers to.                        |
+-----------------------------------+-----------------------------------+
| **template_name**                 | Resource name of the Tag Template |
|                                   | for the Tag.                      |
+-----------------------------------+-----------------------------------+
| **tag_name**                      | Resource name of the Tag.         |
+-----------------------------------+-----------------------------------+
| **column**                        | Attach Tags to a column belonging |
|                                   | to the Entry schema.              |
+-----------------------------------+-----------------------------------+
| **field_id**                      | Id of the Tag field.              |
+-----------------------------------+-----------------------------------+
| **field_type**                    | Type of the Tag field.            |
+-----------------------------------+-----------------------------------+
| **field_value**                   | Value of the Tag field.           |
+-----------------------------------+-----------------------------------+

.. _run-the-datacatalog-util-script-1:

3.2. Run the datacatalog-util script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Python + virtualenv

.. code:: bash

   datacatalog-util export-tags --project-ids my-project --dir-path DIR_PATH

4. Load Templates from CSV file
-------------------------------

4.1. Create a CSV file representing the Templates to be created
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Templates are composed of as many lines as required to represent all of
their fields. The columns are described as follows:

+------------------------+---------------------------+-----------+
| Column                 | Description               | Mandatory |
+========================+===========================+===========+
| **template_name**      | Resource name of the Tag  | Y         |
|                        | Template for the Tag.     |           |
+------------------------+---------------------------+-----------+
| **display_name**       | Resource name of the Tag  | Y         |
|                        | Template for the Tag.     |           |
+------------------------+---------------------------+-----------+
| **field_id**           | Id of the Tag Template    | Y         |
|                        | field.                    |           |
+------------------------+---------------------------+-----------+
| **field_display_name** | Display name of the Tag   | Y         |
|                        | Template field.           |           |
+------------------------+---------------------------+-----------+
| **field_type**         | Type of the Tag Template  | Y         |
|                        | field.                    |           |
+------------------------+---------------------------+-----------+
| **enum_values**        | Values for the Enum       | N         |
|                        | field.                    |           |
+------------------------+---------------------------+-----------+

4.2. Run the datacatalog-util script - Create the Tag Templates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Python + virtualenv

.. code:: bash

   datacatalog-util create-tag-templates --csv-file CSV_FILE_PATH

4.3. Run the datacatalog-util script - Delete the Tag Templates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Python + virtualenv

.. code:: bash

   datacatalog-util delete-tag-templates --csv-file CSV_FILE_PATH

*TIPS* -
`sample-input/create-tag-templates <https://github.com/mesmacosta/datacatalog-util/tree/master/sample-input/create-tag-templates>`__
for reference;

5. Export Templates to CSV file
-------------------------------

5.1. A CSV file representing the Templates will be created
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Templates are composed of as many lines as required to represent all of
their fields. The columns are described as follows:

====================== ==============================================
Column                 Description
====================== ==============================================
**template_name**      Resource name of the Tag Template for the Tag.
**display_name**       Resource name of the Tag Template for the Tag.
**field_id**           Id of the Tag Template field.
**field_display_name** Display name of the Tag Template field.
**field_type**         Type of the Tag Template field.
**enum_values**        Values for the Enum field.
====================== ==============================================

.. _run-the-datacatalog-util-script-2:

5.2. Run the datacatalog-util script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Python + virtualenv

.. code:: bash

   datacatalog-util export-tag-templates --project-ids my-project --file-path CSV_FILE_PATH


