Metadata-Version: 2.1
Name: stairlight
Version: 0.6.0
Summary: An end-to-end data lineage tool
License: MIT
Author: tosh2230
Author-email: rev.to12@gmail.com
Requires-Python: >=3.7.2,<4.0.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Provides-Extra: dbt-bigquery
Provides-Extra: gcs
Provides-Extra: redash
Provides-Extra: s3
Requires-Dist: Jinja2 (>=2.10.3)
Requires-Dist: PyYAML (>=5.0)
Requires-Dist: SQLAlchemy (>=1.4.31,<2.0.0); extra == "redash"
Requires-Dist: boto3 (>=1.24.14,<2.0.0); extra == "s3"
Requires-Dist: boto3-stubs[s3] (>=1.24.17,<2.0.0); extra == "s3"
Requires-Dist: dbt-bigquery (>=1.1,<2.0); extra == "dbt-bigquery"
Requires-Dist: dbt-core (>=1.1,<2.0); extra == "dbt-bigquery"
Requires-Dist: google-cloud-storage (>=1.28.1,<2.0.0); extra == "gcs"
Requires-Dist: psycopg2 (>=2.9.3,<3.0.0); extra == "redash"
Description-Content-Type: text/markdown

<div align="center">
  <img src="https://raw.githubusercontent.com/tosh2230/stairlight/main/img/stairlight_white.png" width="400" alt="Stairlight">
</div>

-----------------

# Stairlight

[![PyPi Version](https://img.shields.io/pypi/v/stairlight.svg?style=flat-square&logo=PyPi)](https://pypi.org/project/stairlight/)
[![PyPi License](https://img.shields.io/pypi/l/stairlight.svg?style=flat-square)](https://pypi.org/project/stairlight/)
[![PyPi Python Versions](https://img.shields.io/pypi/pyversions/stairlight.svg?style=flat-square)](https://pypi.org/project/stairlight/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/psf/black)
[![CI](https://github.com/tosh2230/stairlight/actions/workflows/ci.yml/badge.svg)](https://github.com/tosh2230/stairlight/actions/workflows/ci.yml)

An end-to-end data lineage tool, detects table dependencies by SQL SELECT statements.

<div align="left">
  <img src="https://raw.githubusercontent.com/tosh2230/stairlight/main/img/drawio/concepts.drawio.png" width="800" alt="concepts">
</div>

## Supported Data Sources

| Data Source | Remarks |
| --- | --- |
| Local file system | With Python Pathlib module |
| [Amazon S3](https://aws.amazon.com/s3/) | |
| [Google Cloud Storage](https://cloud.google.com/storage) | Also available for [Google Cloud Composer](https://cloud.google.com/composer) |
| [dbt](https://www.getdbt.com/) - [Google BigQuery](https://cloud.google.com/bigquery) | Using `dbt compile` command internally |
| [Redash](https://redash.io/) | |

## Installation

This package is distributed on [PyPI](https://pypi.org/project/stairlight/).

```sh
# The base package is for local file system only.
$ pip install stairlight

# Set extras when detecting from other data sources.
# e.g. Amazon S3 and Google Cloud Storage
$ pip install "stairlight[s3, gcs]"
```

| Data Source | TemplateSourceType | Extra |
| --- | --- | --- |
| Local file system | File | - |
| Amazon S3 | S3 | s3 |
| Google Cloud Storage | GCS | gcs |
| dbt - Google Bigquery | dbt | dbt-bigquery |
| Redash | Redash | redash |

## Getting Started

There are 3 steps to use.

```sh
# Step 1: Initialize and set data location settings
$ stairlight init
'./stairlight.yaml' has created.
Please edit it to set your data sources.

# Step 2: Map your SQL statements and tables, and add metadata
$ stairlight map
'./mapping_yyyyMMddhhmmss.yaml' has created.
Please map undefined tables and parameters, and append to your latest configuration file.

# Step 3: Get a table dependency map
$ stairlight
```

## Description

### Input

- SQL SELECT statements
- Configuration YAML files
    - stairlight.yaml: SQL statements locations and include/exclude conditions.
    - mapping.yaml: Mapping SQL statements and tables.

### Output

- JSON dependency map

    <details>

    <summary>Example</summary>

    ```json
    {
      "PROJECT_d.DATASET_e.TABLE_f": {
        "PROJECT_j.DATASET_k.TABLE_l": {
          "TemplateSourceType": "File",
          "Key": "tests/sql/main/one_line_2.sql",
          "Uri": "/foo/bar/stairlight/tests/sql/main/one_line_2.sql",
          "Lines": [
            {
              "LineNumber": 1,
              "LineString": "SELECT * FROM PROJECT_j.DATASET_k.TABLE_l WHERE 1 = 1"
            }
          ]
        },
        "PROJECT_C.DATASET_C.TABLE_C": {
          "TemplateSourceType": "GCS",
          "Key": "sql/cte/cte_multi_line.sql",
          "Uri": "gs://stairlight/sql/cte/cte_multi_line.sql",
          "Lines": [
            {
              "LineNumber": 6,
              "LineString": "        PROJECT_C.DATASET_C.TABLE_C"
            }
          ],
          "BucketName": "stairlight",
          "Labels": {
            "Source": "gcs",
            "Test": "b"
          }
        }
      },
      "AggregateSales": {
        "PROJECT_e.DATASET_e.TABLE_e": {
          "TemplateSourceType": "Redash",
          "Key": 5,
          "Uri": "AggregateSales",
          "Lines": [
            {
              "LineNumber": 1,
              "LineString": "SELECT service, SUM(total_amount) FROM PROJECT_e.DATASET_e.TABLE_e GROUP BY service"
            }
          ],
          "DataSourceName": "BigQuery",
          "Labels": {
            "Category": "Sales"
          }
        }
      },
      "dummy.dummy.example_b": {
        "PROJECT_t.DATASET_t.TABLE_t": {
          "TemplateSourceType": "dbt",
          "Key": "tests/dbt/project_01/target/compiled/project_01/models/b/example_b.sql",
          "Uri": "/foo/bar/stairlight/tests/dbt/project_01/target/compiled/project_01/models/b/example_b.sql",
          "Lines": [
            {
              "LineNumber": 1,
              "LineString": "select * from PROJECT_t.DATASET_t.TABLE_t where value_a = 0 and value_b = 0"
            }
          ]
        }
      },
      "PROJECT_as.DATASET_bs.TABLE_cs": {
        "PROJECT_A.DATASET_A.TABLE_A": {
          "TemplateSourceType": "S3",
          "Key": "sql/one_line/one_line.sql",
          "Uri": "s3://stairlight/sql/one_line/one_line.sql",
          "Lines": [
            {
              "LineNumber": 1,
              "LineString": "SELECT * FROM PROJECT_A.DATASET_A.TABLE_A WHERE 1 = 1"
            }
          ],
          "BucketName": "stairlight",
          "Labels": {
            "Source": null,
            "Test": "a"
          }
        }
      }
    }
    ```

    </details>

### Collecting patterns

#### Centralization

<div align="left">
  <img src="https://raw.githubusercontent.com/tosh2230/stairlight/main/img/drawio/centralization.drawio.png" width="800" alt="centralization">
</div>

#### Agents

<div align="left">
  <img src="https://raw.githubusercontent.com/tosh2230/stairlight/main/img/drawio/agents.drawio.png" width="800" alt="agents">
</div>

## Configuration

Configuration files can be found [here](https://github.com/tosh2230/stairlight/tree/main/tests/config), used for unit testing in CI.

### stairlight.yaml

'stairlight.yaml' is for setting up Stairlight itself. It is responsible for specifying SQL statements to be read.

`init` command creates a template of stairlight.yaml.

  <details>

  <summary>Example</summary>

  ```yaml
  Include:
    - TemplateSourceType: File
      FileSystemPath: "./tests/sql"
      Regex: ".*/*.sql$"
      DefaultTablePrefix: "PROJECT_A"
    - TemplateSourceType: GCS
      ProjectId: null
      BucketName: stairlight
      Regex: "^sql/.*/*.sql$"
      DefaultTablePrefix: "PROJECT_A"
    - TemplateSourceType: Redash
      DatabaseUrlEnvironmentVariable: REDASH_DATABASE_URL
      DataSourceName: BigQuery
      QueryIds:
        - 1
        - 3
        - 5
    - TemplateSourceType: dbt
      ProjectDir: tests/dbt/project_01
      ProfilesDir: tests/dbt
      Vars:
        key_a: value_a
        key_b: value_b
    - TemplateSourceType: S3
      BucketName: stairlight
      Regex: "^sql/.*/*.sql$"
      DefaultTablePrefix: "PROJECT_A"
  Exclude:
    - TemplateSourceType: File
      Regex: "main/exclude.sql$"
  Settings:
    MappingPrefix: "mapping"
  ```

  </details>

### mapping.yaml

'mapping.yaml' is used to define relationships between input SELECT statements and table names.

`map` command creates a template of mapping.yaml, based on the configuration of stairlight.yaml.

  <details>

  <summary>Example</summary>

  ```yaml
  Global:
    Parameters:
      DESTINATION_PROJECT: stairlight
      params:
        PROJECT: 1234567890
        DATASET: public
        TABLE: taxirides
  Mapping:
    - TemplateSourceType: File
      FileSuffix: "tests/sql/main/union_same_table.sql"
      Tables:
        - TableName: "test_project.beam_streaming.taxirides_aggregation"
          Parameters:
            params:
              source_table: source
              destination_table: destination
          IgnoreParameters:
            - execution_date.add(days=1).isoformat()
    - TemplateSourceType: GCS
      Uri: "gs://stairlight/sql/one_line/one_line.sql"
      Tables:
        - TableName: "PROJECT_a.DATASET_b.TABLE_c"
    - TemplateSourceType: Redash
      QueryId: 5
      DataSourceName: metadata
      Tables:
        - TableName: New Query
          Parameters:
            table: dashboards
          Labels:
            Category: Redash test
    - TemplateSourceType: dbt
      ProjectName: project_01
      FileSuffix: tests/dbt/project_01/target/compiled/project_01/models/example/my_first_dbt_model.sql
      Tables:
        - TableName: dummy.dummy.my_first_dbt_model
    - TemplateSourceType: S3
      Uri: "s3://stairlight/sql/one_line/one_line.sql"
      Tables:
        - TableName: "PROJECT_as.DATASET_bs.TABLE_cs"
  Metadata:
    - TableName: "PROJECT_A.DATASET_A.TABLE_A"
      Labels:
        Source: Null
        Test: a
  ```

  </details>

#### Global Section

This section is for global configurations.

`Parameters` attribute is used to set common parameters. If conflicts has occurred with `Parameters` attributes in mapping section, mapping section's parameters will be used in preference to global.

#### Mapping Section

Mapping section is used to define relationships between input SELECT statements and tables that created as a result of query execution.

`Parameters` attribute allows you to reflect settings in [jinja](https://jinja.palletsprojects.com/) template variables embedded in statements. If multiple settings are applied to a statement using jinja template, the statement will be read as if there were the same number of queries as the number of settings.

In contrast, `IgnoreParameters` attribute handles a list to ignore when rendering queries.

#### Metadata Section

This section is mainly used to set metadata to tables appears only in queries.

## Arguments and Options

```txt
$ stairlight --help
usage: stairlight [-h] [-c CONFIG] [--save SAVE] [--load LOAD] {init,check,up,down} ...

An end-to-end data lineage tool, detects table dependencies by SQL SELECT statements.
Without positional arguments, return a table dependency map as JSON format.

positional arguments:
  {init,map,check,up,down}
    init                create new Stairlight configuration file
    map (check)         create new configuration file about undefined mappings
    up                  return upstairs ( table | SQL file ) list
    down                return downstairs ( table | SQL file ) list

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory
  -q, --quiet           keep silence
  --save SAVE           A file path where map results will be saved.
                        You can choose from local file system, GCS, S3.
  --load LOAD           A file path where map results are saved.
                        You can choose from local file system, GCS, S3.
                        It can be specified multiple times.
```

### init

`init` creates a new Stairlight configuration file.

```txt
$ stairlight init --help
usage: stairlight init [-h] [-c CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory.
  -q, --quiet           keep silence
```

### map(check)

`map` creates new configuration file about undefined mappings.`check` is an alias.
The option specification is the same as `init`.

### up

`up` outputs a list of tables or SQL files located upstream from the specified table.

- Use table(`-t`, `--table`) or label(`-l`, `--label`) option to specify tables to search.
- Recursive option(`-r`, `--recursive`) is set, Stairlight will find tables recursively and output as a list.
- Verbose option(`-v`, `--verbose`) is set, Stairlight will add detailed information and output it as a dict.

```txt
$ stairlight up --help
usage: stairlight up [-h] [-c CONFIG] [--save SAVE] [--load LOAD] (-t TABLE | -l LABEL) [-o {table,file}]
                     [-v] [-r]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        set Stairlight configuration directory
  -q, --quiet           keep silence
  --save SAVE           A file path where map results will be saved.
                        You can choose from local file system, GCS, S3.
  --load LOAD           A file path where map results are saved.
                        You can choose from local file system, GCS, S3.
                        It can be specified multiple times.
  -t TABLE, --table TABLE
                        table names that Stairlight searches for, can be specified
                        multiple times. e.g. -t PROJECT_a.DATASET_b.TABLE_c -t
                        PROJECT_d.DATASET_e.TABLE_f
  -l LABEL, --label LABEL
                        labels set for the table in mapping configuration, can be specified multiple times.
                        The separator between key and value should be a colon(:).
                        e.g. -l key_1:value_1 -l key_2:value_2
  -o {table,file}, --output {table,file}
                        output type
  -v, --verbose         return verbose results
  -r, --recursive       search recursively
```

### down

`down` outputs a list of tables or SQL files located downstream from the specified table.
The option specification is the same as `up`.

## Use as a library

Stairlight can also be used as a library.

[tosh2230/stairlight-app](https://github.com/tosh2230/stairlight-app) is a sample web application rendering table dependency graph with Stairlight, using Graphviz, Streamlit and Google Cloud Run.

