Metadata-Version: 2.3
Name: databricks-dbt-factory
Version: 0.0.1
Summary: Databricks dbt factory library for creating Databricks Job definition where individual models are run as separate tasks.
Project-URL: Documentation, https://github.com/mwojtyczka/databricks-dbt-factory#readme
Project-URL: Issues, https://github.com/mwojtyczka/databricks-dbt-factory/issues
Project-URL: Source, https://github.com/mwojtyczka/databricks-dbt-factory
Author-email: Marcin Wojtyczka <marcin.wojtyczka@databricks.com>
License: MIT
Keywords: Databricks,dbt
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Requires-Dist: pyyaml~=6.0.1
Description-Content-Type: text/markdown

Databricks dbt factory
===

Databricks dbt factory is a simple library to generate Databricks Job tasks based on dbt manifest.
The tool can overwrite tasks in the existing Databricks job definition (in-place update, or creating new definition).

[![PyPI - Version](https://img.shields.io/pypi/v/databricks-dbt-factory.svg)](https://pypi.org/project/databricks-dbt-factory)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/databricks-dbt-factory.svg)](https://pypi.org/project/databricks-dbt-factory)

-----

**Table of Contents**

- [Installation](#installation)
- [Usage](#usage)
- [Contribution](#contribution)
- [License](#license)

# Motivation

The current integration of dbt with Databricks Workflows treats the entire dbt project as a single execution unit (black box), limiting flexibility and debugging options.

This project breaks down each dbt object (seed/snapshot/model/test) into separate Workflow task offering several key benefits:
* Simplified Troubleshooting: Isolating tasks makes it easier to identify and resolve issues specific to a single model
* Enhanced Logging and Notifications: Provides more detailed logs and precise error alerts, improving debugging efficiency
* Better Retriability: Enables retrying only the failed model tasks, saving time and resources compared to rerunning the entire project
* Seamless Testing: Allows running dbt data tests on tables immediately after a model completes, ensuring faster validation and feedback

### Databricks Workflows run all dbt objects at once:
![before](docs/before.png?)

![dbt_task](docs/dbt_task.png?)

### The tool generates workflows where dbt objects are run as individual Databricks Workflow tasks:
![after](docs/after.png?)

![workflow](docs/workflow.png?)

# Installation

```shell
pip install databricks-dbt-factory
```

# Usage

Update tasks in the existing Databricks job definition and write the results to `job_definition_new.yaml`:
```shell
databricks_dbt_factory  \
  --dbt-manifest-path tests/test_data/manifest.json \
  --input-job-spec-path tests/test_data/job_definition_template.yaml \
  --target-job-spec-path job_definition_new.yaml \
  --source GIT \
  --target dev
```

**Arguments:**
- `--new-job-name` (type: str, optional, default: None): Optional job name. If provided, the existing job name in the job spec is updated.
- `--dbt-manifest-path` (type: str, required): Path to the manifest file.
- `--input-job-spec-path` (type: str, required): Path to the input job spec file.
- `--target-job-spec-path` (type: str, required): Path to the target job spec file.
- `--target` (type: str, required): dbt target to use.
- `--source` (type: str, optional, default: None): Optional project source. If not provided, WORKSPACE will be used.
- `--warehouse_id` (type: str, optional, default: None): Optional SQL Warehouse to run dbt models on.
- `--schema` (type: str, optional, default: None): Optional schema to write to.
- `--catalog` (type: str, optional, default: None): Optional catalog to write to.
- `--profiles-directory` (type: str, optional, default: None): Optional (relative) path to the profiles directory.
- `--project-directory` (type: str, optional, default: None): Optional (relative) path to the project directory.
- `--environment-key` (type: str, optional, default: Default): Optional (relative) key of an environment.
- `--extra-dbt-command-options` (type: str, optional, default: ""): Optional additional dbt command options.
- `--run-tests` (type: bool, optional, default: True): Whether to run data tests after the model. Enabled by default.
- `--dry-run` (type: bool, optional, default: False): Print generated tasks without updating the job spec file. Disabled by default.

You can also check all input arguments by running `databricks_dbt_factory --help`.

Demo of the tool can be found [here](https://github.com/mwojtyczka/dbt-demo).

# Contribution

See contribution guidance [here](CONTRIBUTING.md).

# License

`databricks-dbt-factory` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
