-
Notifications
You must be signed in to change notification settings - Fork 137
Handle concurrency issues withCREATE OR REPLACE TABLE
#1011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Even if Databricks supported this (which we don't do well), dbt does not generally support concurrent modification of the same relation by multiple jobs. Materializations implicitly assume that after they get information about a relation that nothing is going to modify it for the duration of the materialization. Imagine one dbt job that drops a table that another job is in the middle of processing. |
@benc-db I agree that it doesn't make sense to be handled by dbt-databricks, I was requested to open the issue here by Databricks support. Is there a best practices on how to handle this sort of thing? Specifically we are running into 2 different airflow DAGS running the same DBT model at the same time with the exact same definition but hitting this concurrency issue. Our engineers are familiar with spark (not too familiar with Databricks) and we had assumed REPLACE on delta tables was built in the same way as a spark table overwrite (which is ACID). Turns out this is not the case and we are trying to find a workaround that allows these business units to operate independent of each other. |
To clarify, are the two jobs both modifying the table, or are you hitting an issue when one modifies and the other uses as a source? If both are modifying, why are multiple business units modifying the same table? I would think that one owns the data and produces the table, and the other reads it as a source. If this is already what you are doing, then yeah, there should be a way to structure things (maybe with serialization settings) such that readers are not broken by writers. |
Its two airflow tasks, running different dbt selectors that have overlapping models. For example:
Both models end up executing |
Describe the feature
Currently we are hitting frequent errors where
CREATE OR REPLACE TABLE
returns eitherTABLE_OR_VIEW_ALREADY_EXISTS
orTABLE_OR_VIEW_NOT_FOUND
. This is a result of different dbt jobs operating on the table at the same time. Not aware of any ways of handling this sort of concurrency with dbt-databricks currently.Describe alternatives you've considered
I have opened the issue with Databricks support but they requested that I open an issue here as well. Their support engineers recommended either adding a retry when hitting these errors or queuing the operation.
Who will this benefit?
Anyone who has to perform different dbt jobs with overlapping models
The text was updated successfully, but these errors were encountered: