Handle concurrency issues with`CREATE OR REPLACE TABLE` #1011

ShaneMazur · 2025-05-05T17:14:40Z

Describe the feature

Currently we are hitting frequent errors where CREATE OR REPLACE TABLE returns either TABLE_OR_VIEW_ALREADY_EXISTS or TABLE_OR_VIEW_NOT_FOUND. This is a result of different dbt jobs operating on the table at the same time. Not aware of any ways of handling this sort of concurrency with dbt-databricks currently.

Describe alternatives you've considered

I have opened the issue with Databricks support but they requested that I open an issue here as well. Their support engineers recommended either adding a retry when hitting these errors or queuing the operation.

Who will this benefit?

Anyone who has to perform different dbt jobs with overlapping models

The text was updated successfully, but these errors were encountered:

benc-db · 2025-05-06T18:11:32Z

Even if Databricks supported this (which we don't do well), dbt does not generally support concurrent modification of the same relation by multiple jobs. Materializations implicitly assume that after they get information about a relation that nothing is going to modify it for the duration of the materialization. Imagine one dbt job that drops a table that another job is in the middle of processing.

ShaneMazur · 2025-05-06T19:26:29Z

@benc-db I agree that it doesn't make sense to be handled by dbt-databricks, I was requested to open the issue here by Databricks support. Is there a best practices on how to handle this sort of thing? Specifically we are running into 2 different airflow DAGS running the same DBT model at the same time with the exact same definition but hitting this concurrency issue.

Our engineers are familiar with spark (not too familiar with Databricks) and we had assumed REPLACE on delta tables was built in the same way as a spark table overwrite (which is ACID). Turns out this is not the case and we are trying to find a workaround that allows these business units to operate independent of each other.

benc-db · 2025-05-06T20:05:46Z

To clarify, are the two jobs both modifying the table, or are you hitting an issue when one modifies and the other uses as a source? If both are modifying, why are multiple business units modifying the same table? I would think that one owns the data and produces the table, and the other reads it as a source. If this is already what you are doing, then yeah, there should be a way to structure things (maybe with serialization settings) such that readers are not broken by writers.

ShaneMazur · 2025-05-06T20:15:34Z

Its two airflow tasks, running different dbt selectors that have overlapping models.

For example:

dbt run --select +customer_success_semantics which may have an upstream model accounts_stage
dbt run --select +finance_metrics_semantics which also has an upstream model accounts_stage

Both models end up executing CREATE OR REPLACE production.stage.accounts_stage AS (...) at the same time with the same definition. Both business units need to guarantee data freshness (which is why they must also run the upstreams of their semantics models)

ShaneMazur added the enhancement New feature or request label May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle concurrency issues with`CREATE OR REPLACE TABLE` #1011

Handle concurrency issues with`CREATE OR REPLACE TABLE` #1011

ShaneMazur commented May 5, 2025

benc-db commented May 6, 2025

ShaneMazur commented May 6, 2025

benc-db commented May 6, 2025

ShaneMazur commented May 6, 2025

Handle concurrency issues withCREATE OR REPLACE TABLE #1011

Handle concurrency issues withCREATE OR REPLACE TABLE #1011

Comments

ShaneMazur commented May 5, 2025

Describe the feature

Describe alternatives you've considered

Who will this benefit?

benc-db commented May 6, 2025

ShaneMazur commented May 6, 2025

benc-db commented May 6, 2025

ShaneMazur commented May 6, 2025

Handle concurrency issues with`CREATE OR REPLACE TABLE` #1011

Handle concurrency issues with`CREATE OR REPLACE TABLE` #1011