Skip to content

Handle concurrency issues withCREATE OR REPLACE TABLE #1011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ShaneMazur opened this issue May 5, 2025 · 4 comments
Open

Handle concurrency issues withCREATE OR REPLACE TABLE #1011

ShaneMazur opened this issue May 5, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@ShaneMazur
Copy link
Contributor

Describe the feature

Currently we are hitting frequent errors where CREATE OR REPLACE TABLE returns either TABLE_OR_VIEW_ALREADY_EXISTS or TABLE_OR_VIEW_NOT_FOUND. This is a result of different dbt jobs operating on the table at the same time. Not aware of any ways of handling this sort of concurrency with dbt-databricks currently.

Describe alternatives you've considered

I have opened the issue with Databricks support but they requested that I open an issue here as well. Their support engineers recommended either adding a retry when hitting these errors or queuing the operation.

Who will this benefit?

Anyone who has to perform different dbt jobs with overlapping models

@ShaneMazur ShaneMazur added the enhancement New feature or request label May 5, 2025
@benc-db
Copy link
Collaborator

benc-db commented May 6, 2025

Even if Databricks supported this (which we don't do well), dbt does not generally support concurrent modification of the same relation by multiple jobs. Materializations implicitly assume that after they get information about a relation that nothing is going to modify it for the duration of the materialization. Imagine one dbt job that drops a table that another job is in the middle of processing.

@ShaneMazur
Copy link
Contributor Author

@benc-db I agree that it doesn't make sense to be handled by dbt-databricks, I was requested to open the issue here by Databricks support. Is there a best practices on how to handle this sort of thing? Specifically we are running into 2 different airflow DAGS running the same DBT model at the same time with the exact same definition but hitting this concurrency issue.

Our engineers are familiar with spark (not too familiar with Databricks) and we had assumed REPLACE on delta tables was built in the same way as a spark table overwrite (which is ACID). Turns out this is not the case and we are trying to find a workaround that allows these business units to operate independent of each other.

@benc-db
Copy link
Collaborator

benc-db commented May 6, 2025

To clarify, are the two jobs both modifying the table, or are you hitting an issue when one modifies and the other uses as a source? If both are modifying, why are multiple business units modifying the same table? I would think that one owns the data and produces the table, and the other reads it as a source. If this is already what you are doing, then yeah, there should be a way to structure things (maybe with serialization settings) such that readers are not broken by writers.

@ShaneMazur
Copy link
Contributor Author

Its two airflow tasks, running different dbt selectors that have overlapping models.

For example:

  1. dbt run --select +customer_success_semantics which may have an upstream model accounts_stage
  2. dbt run --select +finance_metrics_semantics which also has an upstream model accounts_stage

Both models end up executing CREATE OR REPLACE production.stage.accounts_stage AS (...) at the same time with the same definition. Both business units need to guarantee data freshness (which is why they must also run the upstreams of their semantics models)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants