|
1 | 1 | ---
|
2 | 2 | title: SSL Issues with Databricks Connect
|
3 |
| -date: 2024-05-15 10:00:00 +0100 |
4 |
| -categories: [Databricks] |
5 |
| -tags: [databricks, spark, databricks-connect, spark-connect, ssl, ] |
| 3 | +date: 2024-05-15 17:00:00 +0100 |
| 4 | +categories: [Databricks, Databrick-Connect, Spark-Connect] |
| 5 | +tags: [databricks, spark, databricks-connect, spark-connect, ssl, gRPC, certificates, windows, python, pip, requests, certifi, pip-system-certs, python-certifi-win32] |
6 | 6 | ---
|
7 | 7 |
|
8 |
| -## The Issue |
| 8 | +## Issue |
9 | 9 |
|
10 |
| -If you are trying to access workspace files or files located in a repository folder (as described [here](https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact)) from a shared cluster in Databricks, you might run into the following error: |
| 10 | +If you are running Databricks-Connect locally on a Windows machine in a (company) network in which custom certificates (root & intermediary) are being used, you might run into the following (or a similar) error(s) when executing your code: |
11 | 11 |
|
12 |
| -`java.lang.SecurityException: User does not have permission SELECT on any file.` |
13 |
| - |
14 |
| -The underlying problem has to do with the [restrictions](https://docs.databricks.com/en/clusters/configure.html#shared-access-mode-limitations) you are facing when using shared clusters in Databricks. |
| 12 | +```text |
| 13 | +Handshake failed with fatal error SSL ERROR SSL: SSL routines :OPENSSL internal:CERTIFICATE VERIFY FAILED. |
| 14 | +``` |
15 | 15 |
|
16 |
| -If you want to access Unity Catalog with a cluster the only two options regarding `Access Mode` currently are `Shared` or `Single User` though. If the latter is out of the question as it often happens in projects that i am participating in, then you will not be able to access workspace files from the only cluster option that is left. |
| 16 | +Root cause is that Databricks-Connect is using [gRPC](https://grpc.io/), which does (in its current implementation) not know about any custom/company certificates, even if they are correctly rolled out to your local Windows cert-store. |
17 | 17 |
|
18 |
| -## The Solution |
| 18 | +The problem is very similar to the SSL issues you run into when using pip/requests or any other module that is using [certifi](https://pypi.org/project/certifi/). On the requests/certifi/pip topic you can find numerous posts online, for example [this one](https://stackoverflow.com/questions/51390968/python-ssl-certificate-verify-error#:~:text=34-,Update,-python%2Dcertifi%2Dwin32). The solution(s) usually involve installing [pip-system-certs](https://pypi.org/project/pip-system-certs/) or the (meanwhile deprecated) [python-certifi-win32](https://pypi.org/project/python-certifi-win32/). These fixes however do not solve your gRPC issues in Databricks-Connect. |
19 | 19 |
|
20 |
| -One way of dealing with this problem is to use [Databricks Python SDK](https://databricks-sdk-py.readthedocs.io/en/latest/). |
| 20 | +## Solution |
21 | 21 |
|
22 |
| -### (1) Install the SDK on your Cluster |
| 22 | +### 1. Create a `.pem` file containing company-specific certificates |
23 | 23 |
|
24 |
| -First you need to install the SDK on your current cluster. There are many ways to install libraries on your cluster, the easiest one is to add this line of code in a notebook cell of its own: |
| 24 | +Assuming you have all the relevant company-specific certificates rolled out to your local Windows cert-store[^1] you can use this little python script to extract them: |
25 | 25 |
|
26 | 26 | ```python
|
27 |
| -%pip install databricks-sdk --upgrade |
| 27 | +import ssl |
| 28 | + |
| 29 | +context = ssl.create_default_context() |
| 30 | +der_certs = context.get_ca_certs(binary_form=True) |
| 31 | +pem_certs = [ssl.DER_cert_to_PEM_cert(der) for der in der_certs] |
| 32 | + |
| 33 | +with open('wincacerts.pem', 'w') as outfile: |
| 34 | + for pem in pem_certs: |
| 35 | + outfile.write(pem + '\n') |
28 | 36 | ```
|
29 | 37 |
|
30 |
| -### (2) Initialize, Connect & Authenticate |
| 38 | +This will create a file (wincacerts.pem) with all the certificates that are currently residing in your Windows cert-store. |
31 | 39 |
|
32 |
| -Then you need to connect to your workspace and authenticate. In my example i am assuming that you have a PAT stored in a secret scope called `keyvault` and the secret name is `databrickspat`. You can also provide a token in the code, but that is __never__ recommended. |
| 40 | +I suggest adding these certificates to the standard certificates that are shipped with [certifi](https://pypi.org/project/certifi)[^2]. To find those use: |
33 | 41 |
|
34 |
| -```python |
35 |
| -from databricks.sdk import WorkspaceClient |
| 42 | +```py |
| 43 | +import certifi |
| 44 | + |
| 45 | +print(certifi.where()) |
36 | 46 |
|
37 |
| -w = WorkspaceClient( |
38 |
| - host = spark.conf.get("spark.databricks.workspaceUrl"), |
39 |
| - token = dbutils.secrets.get(scope="keyvault", key="databrickspat") |
40 |
| -) |
| 47 | +>>> 'd:\repos\XXX\.venv\lib\site-packages\certifi\cacert.pem' |
41 | 48 | ```
|
42 | 49 |
|
43 |
| -### (3) Interact with Workspace Files |
44 | 50 |
|
45 |
| -```python |
46 |
| -# List DBFS with dbutils |
47 |
| -w.dbutils.fs.ls("/") |
48 |
| - |
49 |
| -# ...or with 'dbfs' |
50 |
| -for file_ in w.dbfs.list("/"): |
51 |
| - print(file_) |
52 |
| - |
53 |
| -# List workspace files of a user |
54 |
| -for file_ in w.workspace.list("/Users/<username>", recursive=True): |
55 |
| - print(file_.path) |
56 |
| - |
57 |
| -# List repository files of a user |
58 |
| -for file_ in w.workspace.list("/Repos/<username>/<repo>"): |
59 |
| - print(file_.path) |
60 |
| - |
61 |
| -# Get contents of a yaml file stored in Repos/... |
62 |
| -for line in w.workspace.download(path="/Repos/<username>/<repo>/.../catalog.yml"): |
63 |
| - print(line.decode("UTF-8").replace("\n", "")) |
64 |
| - |
65 |
| -# Upload a (text) file to user repository folder |
66 |
| -import base64 |
67 |
| -from databricks.sdk.service import workspace |
68 |
| - |
69 |
| -path = "/Repos/<username>/<repo>/.../test.yml" |
70 |
| -w.workspace.import_(content=base64.b64encode(("This is the file's content").encode()).decode(), |
71 |
| - format=workspace.ImportFormat.AUTO, |
72 |
| - overwrite=True, |
73 |
| - path=path |
74 |
| -) |
75 |
| -``` |
76 | 51 |
|
77 |
| -### Some Example Screenshots |
| 52 | +Open the file found at the returned location plus the file we created before in any text editor and append the contents of `cacert.pem` into `wincacerts.pem` via simple copy/paste. The order of certificates within this file doesn't matter. Afterwards save and move `wincacerts.pem` to any location in your Windows user home. For example: `C:\Users\<username>\certs\wincacerts.pem` |
| 53 | + |
| 54 | +### 2. Refer to custom certificates via environment variable(s) |
78 | 55 |
|
79 |
| -List files in DBFS: |
80 |
| - |
| 56 | +Now we just need to tell `gRPC` where our custom certificate file is located. The easiest approach is to use the environment variable `GRPC_DEFAULT_SSL_ROOTS_FILE_PATH`. You can set that environment variable either in VS-Code (for a specific project/terminal) or - the method i prefer - in your user's Windows environment variables[^3]: |
| 57 | + |
| 58 | +<img src="../assets/img/ssl_grpc_env.png" width="65%"/><br/> |
| 59 | +_Hint: After setting new environment variables you need to restart all open terminals/shells to make the changes effective._ |
| 60 | + |
| 61 | +--- |
| 62 | +:tada: **And that's about it!** :tada: |
81 | 63 |
|
82 |
| -List files in User/Workspace: |
83 |
| - |
| 64 | +Next time you run your Databricks-Connect code, the SSL handshake error should be gone. |
84 | 65 |
|
85 |
| -Show contents of a yaml file saved in user's repository folder: |
86 |
| - |
| 66 | +[^1]: You can check your Windows cert-store via `Win+R -> certmgr.msc`. |
| 67 | +[^2]: In case you are missing [certifi](https://pypi.org/project/certifi) in your Python environment, run `pip install certifi` first. |
| 68 | +[^3]: You can set environment variables in Windows via `Win+R -> sysdm.cpl -> Advanced -> Environment Variables...`. |
0 commit comments