Metadata-Version: 2.1
Name: databricks-connect
Version: 13.1.0b1
Summary: Databricks Connect Client
Author: Databricks
Author-email: feedback@databricks.com
License: Databricks Proprietary License
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Provides: pyspark
Obsoletes: pyspark
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: py4j (==0.10.9.7)
Requires-Dist: six
Requires-Dist: pandas (>=1.0.5)
Requires-Dist: pyarrow (>=1.0.0)
Requires-Dist: grpcio (>=1.48.1)
Requires-Dist: grpcio-status (>=1.48.1)
Requires-Dist: googleapis-common-protos (>=1.56.4)
Requires-Dist: numpy (>=1.15)
Requires-Dist: databricks-sdk (>=0.1.2)
Provides-Extra: connect
Requires-Dist: pandas (>=1.0.5) ; extra == 'connect'
Requires-Dist: pyarrow (>=1.0.0) ; extra == 'connect'
Requires-Dist: grpcio (>=1.48.1) ; extra == 'connect'
Requires-Dist: grpcio-status (>=1.48.1) ; extra == 'connect'
Requires-Dist: googleapis-common-protos (>=1.56.4) ; extra == 'connect'
Requires-Dist: numpy (>=1.15) ; extra == 'connect'
Provides-Extra: ml
Requires-Dist: numpy (>=1.15) ; extra == 'ml'
Provides-Extra: mllib
Requires-Dist: numpy (>=1.15) ; extra == 'mllib'
Provides-Extra: pandas_on_spark
Requires-Dist: pandas (>=1.0.5) ; extra == 'pandas_on_spark'
Requires-Dist: pyarrow (>=1.0.0) ; extra == 'pandas_on_spark'
Requires-Dist: numpy (>=1.15) ; extra == 'pandas_on_spark'
Provides-Extra: sql
Requires-Dist: pandas (>=1.0.5) ; extra == 'sql'
Requires-Dist: pyarrow (>=1.0.0) ; extra == 'sql'
Requires-Dist: numpy (>=1.15) ; extra == 'sql'

# Databricks Connect

Databricks Connect is a Python library to run PySpark DataFrame queries on a remote Spark cluster.
Databricks Connect leverages the power of [Spark Connect].
An application using Databricks Connect runs locally, and when the results of a DataFrame query
need to be evaluated, the query is run on a configured Databricks cluster.

The following is a simple Python code that uses Databricks Connect and prints out a  number range.
The number range query is executed on the Databricks cluster.

```python
from databricks.connect import DatabricksSession

session = DatabricksSession.builder.getOrCreate()

df = session.range(1, 10)
df.show()
```

## Specifying Connection Parameters

`DatabricksSession` offers a few ways to specify the Databricks workspace, cluster and user
credentials, collectively referred to in the rest of this document as connection parameters.
The specified credentials are used to execute the DataFrame queries on the cluster. This user must
have cluster access permissions and appropriate data access permissions.

*NOTE:* Currently, Databricks Connect only supports credentials based on [Personal Access
Token](https://docs.databricks.com/administration-guide/access-control/tokens.html). Other
authentication mechanisms are coming soon.

When `DatabricksSession` is initialized with no additional parameters as below, connection
parameters are picked up from the environment.

```python
session = DatabricksSession.builder.getOrCreate()
```

First, the `SPARK_REMOTE` environment variable is used if it's configured.

If configured, the `SPARK_REMOTE` environment variable must contain the spark connect connection
string.  Read more about spark connect [connection string].

```sh
SPARK_REMOTE="sc://<databricks workspace url>:443/;token=<bearer token>;x-databricks-cluster-id=<cluster id>"
```

If this environment variable is not configured, Databricks Connect will now look for connection
parameters using the [Databricks SDK].

The Databricks Python SDK reads these values from two locations - first from environment variables
that may be configured. For parameters not configured via environment variables, the 'DEFAULT'
profile, if set up, from the configuration file `.databrickscfg`. Databricks Python SDK facilitates
OAuth token refreshing and enables Service Principal client credentials support on AWS and Azure. 
The details on the authentication process, environment variables, and other configuration options 
can be found in the [Databricks SDK].

> Similar to the authentication environment variables, the Databricks SDK reads the cluster
> identifier from the environment variable `DATABRICKS_CLUSTER_ID` or from the `cluster_id` entry
> in the config file.

When the defaults should not be used, the Databricks Connect session can be initialized explicitly
with a `Config` object from the Databricks SDK.
In the below example, we are configuring Databricks Session to use the `foo-user` profile from the
configuration file.
Read more on profiles in configuration files in the [Databricks SDK].

```python
from databricks.sdk.core import Config
from databricks.connect import DatabricksSession

config = Config(
    profile="foo-user",
    # ...
)

session = DatabricksSession.builder.sdkConfig(config).getOrCreate()
```

Connection parameters can also be specified directly in code.

```python
session = DatabricksSession.builder.remote(
    host="<databricks workspace url>",
    cluster_id="<databricks cluster id>",
    token="<bearer token>"
).getOrCreate()
```

The spark connect [connection string] can also be specified directly in code.

```python
session = DatabricksSession.builder\
    .remote("sc://<databricks workspace url>:443/;token=<bearer token>;x-databricks-cluster-id=<cluster id>")\
    .getOrCreate()
```

In summary, connection parameters are collected in the following order. When all connection
parameters are available, evaluation is stopped.
1. Specified directly using `remote()`, either as a connection string or as keyword arguments.
2. Specified via the Databricks SDK using `sdkConfig()`.
3. Specified in the `SPARK_REMOTE` environment variable.
4. Specified via the [Databricks SDK]'s default authentication.

### OAuth

The Databricks Connect module, via the Databricks SDK, supports OAuth authentication mechanism.
This can be configured via configuration profiles in the `.databrickscfg` file.
See [TBD: link here] on how to set up and use configuration profiles.

The following configuration profile snippet sets up OAuth integration via the Azure CLI, and
should be added to the `.databrickscfg` file.

```text
[azure-cli]
host = https://adb-XXX.azuredatabricks.net
auth_type = azure-cli
cluster_id = <databricks cluster id>
```

Similarly, the following snippet sets up OAuth integration via Azure Active Directory (AAD) service
principal.

```text
[azure-aad]
host = https://adb-XXX.azuredatabricks.net
azure_tenant_id = 00000000-0000-0000-0000-000000000001
azure_client_id = 00000000-0000-0000-0000-000000000002
azure_client_secret = s0M3p@$$wrd
cluster_id = YYY
```
[Spark Connect]: https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html
[connection string]: https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md
[Databricks SDK]: https://docs.databricks.com/dev-tools/sdk-python.html
[Azure CLI]: https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli
