Skip to content

Loss of UC column-level lineage due to use of 'temporary views' in 'ephemeral' materialization #979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
w0ut0 opened this issue Mar 26, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@w0ut0
Copy link

w0ut0 commented Mar 26, 2025

Describe the bug

Ephemeral materializations create temporary views. However, when there is a temporary view (ephemeral) between A and B, the column level lineage is lost (not recorded in UC's system tables or API).
While this might be an issue on Databricks side, this is easily circumvented at the adapter side by not using temporary views, but straight up adding the code as CTE in model definition at compile time.

Steps To Reproduce

Give a source A, and model B. Make a dbt project with A --> X (materialization = ephemeral) --> B.
Databricks does not capture column level lineage.
Make a transformation A --> B (with the logic of X in B's model definition), and Databricks does capture the column-level lineage.

Expected behavior

Ephemeral materializations should be (optionally?) put in the definition of the model that uses it, instead of creating temporary views.

System information

The output of dbt --version:

Core:
  - installed: 1.8.7
  - latest:    1.9.3 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.8.7 - Update available!
  - spark:      1.8.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"

The output of python --version:
Python 3.10.12

@w0ut0 w0ut0 added the bug Something isn't working label Mar 26, 2025
@benc-db
Copy link
Collaborator

benc-db commented Mar 26, 2025

We don't have an ephemeral materialization in dbt-databricks, and every time I've used an ephemeral materialization in my own pipelines, it has been inserted as a CTE into the models that reference it. Can you provide the evidence that leads you to think otherwise?

@benc-db
Copy link
Collaborator

benc-db commented Mar 26, 2025

Having said this, it is definitely a concern of the new code in 1.10.0, where we use temporary views in several places.

@benc-db
Copy link
Collaborator

benc-db commented Mar 26, 2025

Is this a new issue, or one that has existed for a while? I'm thinking the issue is not with ephemeral, but perhaps with table materialization itself?

@benc-db
Copy link
Collaborator

benc-db commented Mar 26, 2025

Looking here, there are many things that could cause column lineage to not be captured, among them temporary views. But I see conflicting evidence in my own pipeline, where I do have column lineage even for cases where a temporary view was used. One thing that stood out from the limitations https://docs.databricks.com/aws/en/data-governance/unity-catalog/data-lineage#limitations

"Complete column-level lineage is not captured by default for MERGE operations.

You can turn on lineage capture for MERGE operations by setting the Spark property spark.databricks.dataLineage.mergeIntoV2Enabled to true. Enabling this flag can slow down query performance, particularly in workloads that involve very wide tables."

@benc-db
Copy link
Collaborator

benc-db commented Mar 26, 2025

Spoke with the UC team. Found out that actually they have recently fixed an issue where complex CTEs were causing lineage to be lost, which could very well be what is happening in your case! Unfortunately such fixes some times take a little while to trickle out. I will keep this ticket open, and I appreciate any further information you can provide, but my hope is that Databricks fixes this at the source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants