Skip to content

Commit 486d97e

Browse files
author
Thomas
committed
Add SSL Databricks-Connect blog post.
1 parent e343a2a commit 486d97e

File tree

1 file changed

+43
-61
lines changed

1 file changed

+43
-61
lines changed
Lines changed: 43 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,86 +1,68 @@
11
---
22
title: SSL Issues with Databricks Connect
3-
date: 2024-05-15 10:00:00 +0100
4-
categories: [Databricks]
5-
tags: [databricks, spark, databricks-connect, spark-connect, ssl, ]
3+
date: 2024-05-15 17:00:00 +0100
4+
categories: [Databricks, Databrick-Connect, Spark-Connect]
5+
tags: [databricks, spark, databricks-connect, spark-connect, ssl, gRPC, certificates, windows, python, pip, requests, certifi, pip-system-certs, python-certifi-win32]
66
---
77

8-
## The Issue
8+
## Issue
99

10-
If you are trying to access workspace files or files located in a repository folder (as described [here](https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact)) from a shared cluster in Databricks, you might run into the following error:
10+
If you are running Databricks-Connect locally on a Windows machine in a (company) network in which custom certificates (root & intermediary) are being used, you might run into the following (or a similar) error(s) when executing your code:
1111

12-
`java.lang.SecurityException: User does not have permission SELECT on any file.`
13-
14-
The underlying problem has to do with the [restrictions](https://docs.databricks.com/en/clusters/configure.html#shared-access-mode-limitations) you are facing when using shared clusters in Databricks.
12+
```text
13+
Handshake failed with fatal error SSL ERROR SSL: SSL routines :OPENSSL internal:CERTIFICATE VERIFY FAILED.
14+
```
1515

16-
If you want to access Unity Catalog with a cluster the only two options regarding `Access Mode` currently are `Shared` or `Single User` though. If the latter is out of the question as it often happens in projects that i am participating in, then you will not be able to access workspace files from the only cluster option that is left.
16+
Root cause is that Databricks-Connect is using [gRPC](https://grpc.io/), which does (in its current implementation) not know about any custom/company certificates, even if they are correctly rolled out to your local Windows cert-store.
1717

18-
## The Solution
18+
The problem is very similar to the SSL issues you run into when using pip/requests or any other module that is using [certifi](https://pypi.org/project/certifi/). On the requests/certifi/pip topic you can find numerous posts online, for example [this one](https://stackoverflow.com/questions/51390968/python-ssl-certificate-verify-error#:~:text=34-,Update,-python%2Dcertifi%2Dwin32). The solution(s) usually involve installing [pip-system-certs](https://pypi.org/project/pip-system-certs/) or the (meanwhile deprecated) [python-certifi-win32](https://pypi.org/project/python-certifi-win32/). These fixes however do not solve your gRPC issues in Databricks-Connect.
1919

20-
One way of dealing with this problem is to use [Databricks Python SDK](https://databricks-sdk-py.readthedocs.io/en/latest/).
20+
## Solution
2121

22-
### (1) Install the SDK on your Cluster
22+
### 1. Create a `.pem` file containing company-specific certificates
2323

24-
First you need to install the SDK on your current cluster. There are many ways to install libraries on your cluster, the easiest one is to add this line of code in a notebook cell of its own:
24+
Assuming you have all the relevant company-specific certificates rolled out to your local Windows cert-store[^1] you can use this little python script to extract them:
2525

2626
```python
27-
%pip install databricks-sdk --upgrade
27+
import ssl
28+
29+
context = ssl.create_default_context()
30+
der_certs = context.get_ca_certs(binary_form=True)
31+
pem_certs = [ssl.DER_cert_to_PEM_cert(der) for der in der_certs]
32+
33+
with open('wincacerts.pem', 'w') as outfile:
34+
for pem in pem_certs:
35+
outfile.write(pem + '\n')
2836
```
2937

30-
### (2) Initialize, Connect & Authenticate
38+
This will create a file (wincacerts.pem) with all the certificates that are currently residing in your Windows cert-store.
3139

32-
Then you need to connect to your workspace and authenticate. In my example i am assuming that you have a PAT stored in a secret scope called `keyvault` and the secret name is `databrickspat`. You can also provide a token in the code, but that is __never__ recommended.
40+
I suggest adding these certificates to the standard certificates that are shipped with [certifi](https://pypi.org/project/certifi)[^2]. To find those use:
3341

34-
```python
35-
from databricks.sdk import WorkspaceClient
42+
```py
43+
import certifi
44+
45+
print(certifi.where())
3646

37-
w = WorkspaceClient(
38-
host = spark.conf.get("spark.databricks.workspaceUrl"),
39-
token = dbutils.secrets.get(scope="keyvault", key="databrickspat")
40-
)
47+
>>> 'd:\repos\XXX\.venv\lib\site-packages\certifi\cacert.pem'
4148
```
4249

43-
### (3) Interact with Workspace Files
4450

45-
```python
46-
# List DBFS with dbutils
47-
w.dbutils.fs.ls("/")
48-
49-
# ...or with 'dbfs'
50-
for file_ in w.dbfs.list("/"):
51-
print(file_)
52-
53-
# List workspace files of a user
54-
for file_ in w.workspace.list("/Users/<username>", recursive=True):
55-
print(file_.path)
56-
57-
# List repository files of a user
58-
for file_ in w.workspace.list("/Repos/<username>/<repo>"):
59-
print(file_.path)
60-
61-
# Get contents of a yaml file stored in Repos/...
62-
for line in w.workspace.download(path="/Repos/<username>/<repo>/.../catalog.yml"):
63-
print(line.decode("UTF-8").replace("\n", ""))
64-
65-
# Upload a (text) file to user repository folder
66-
import base64
67-
from databricks.sdk.service import workspace
68-
69-
path = "/Repos/<username>/<repo>/.../test.yml"
70-
w.workspace.import_(content=base64.b64encode(("This is the file's content").encode()).decode(),
71-
format=workspace.ImportFormat.AUTO,
72-
overwrite=True,
73-
path=path
74-
)
75-
```
7651

77-
### Some Example Screenshots
52+
Open the file found at the returned location plus the file we created before in any text editor and append the contents of `cacert.pem` into `wincacerts.pem` via simple copy/paste. The order of certificates within this file doesn't matter. Afterwards save and move `wincacerts.pem` to any location in your Windows user home. For example: `C:\Users\<username>\certs\wincacerts.pem`
53+
54+
### 2. Refer to custom certificates via environment variable(s)
7855

79-
List files in DBFS:
80-
![List DBFS](/assets/img/dbfs.png)
56+
Now we just need to tell `gRPC` where our custom certificate file is located. The easiest approach is to use the environment variable `GRPC_DEFAULT_SSL_ROOTS_FILE_PATH`. You can set that environment variable either in VS-Code (for a specific project/terminal) or - the method i prefer - in your user's Windows environment variables[^3]:
57+
58+
<img src="../assets/img/ssl_grpc_env.png" width="65%"/><br/>
59+
_Hint: After setting new environment variables you need to restart all open terminals/shells to make the changes effective._
60+
61+
---
62+
:tada: **And that's about it!** :tada:
8163

82-
List files in User/Workspace:
83-
![List User Files](/assets/img/user.png)
64+
Next time you run your Databricks-Connect code, the SSL handshake error should be gone.
8465

85-
Show contents of a yaml file saved in user's repository folder:
86-
![Download File](/assets/img/yaml.png)
66+
[^1]: You can check your Windows cert-store via `Win+R -> certmgr.msc`.
67+
[^2]: In case you are missing [certifi](https://pypi.org/project/certifi) in your Python environment, run `pip install certifi` first.
68+
[^3]: You can set environment variables in Windows via `Win+R -> sysdm.cpl -> Advanced -> Environment Variables...`.

0 commit comments

Comments
 (0)