dbt-spark: [CT-202] Workaround for some limitations due to `list_relations_without_caching` method
Describe the feature
I am currently facing an issue using DBT with Spark on AWS/Glue/EMR environment as discussed already in https://github.com/dbt-labs/dbt-spark/issues/215 (but already raised here https://github.com/dbt-labs/dbt-spark/issues/93).
The current issue is about the adapter’s method list_relations_without_caching:
which in the Spark Adapter implementation is:
In this case you can see that the command show table extended in {{ relation }} like '*' is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of “discovery mode” and this approach produces two main issues:
-
Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables.
-
Instability, as I verified in AWS/Glue/EMR environment, where you can have views without “S3 Location” defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like:
show table extended in my_schema like '*'
' with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:00:22 INFO SparkExecuteStatementOperation: Running query with 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 INFO DAGScheduler: Asked to cancel job group 62bb6394-b13a-4b79-92dd-2e0918831cf3
21/09/18 13:01:03 ERROR SparkExecuteStatementOperation: Error executing query with 62bb6394-b13a-4b79-92dd-2e0918831cf3, currentState RUNNING,
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string
Describe alternatives you’ve considered
I do not see the reason why DBT process should care of the “rest of the world” like the Athena views from before or tables created from other processes that are in the same schema.
So I can think ideally to replace the method:
show table extended in <schema> like '*'
with something like:
show table extended in <schema> like ('<table1>|<table2>|…')
where my <table1>, <table2>, etc. are determined automatically when I run a command like
dbt run --models my_folder
where my_folder contains the files: table1.sql, table2.sql, etc
but from the current method interface, only the schema params can be passed.
Two questions here:
How can I infer automatically the name of the tables involved when a command like dbt run --models my_folder run and how can I pass them eventually to the list_relations_without_caching?
Additional context
I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations.
Who will this benefit?
On DBT’s slack channel I talked to another used “affected” by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non).
Are you interested in contributing this feature?
Sure, both coding and testing.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 41 (13 by maintainers)
@jtcohen6 Thank you for the quick adjustments and 2 branch options!
Sorry for the delay, testing took longer than I thought.
I was able to complete testing and below are my findings and some related context around the testing. Additional Context & Final Thoughts to follow below testing/findings.
TL;DR
Testing/Findings
dbt-spark v1.0.0 Code
show table extended in <schema> like '*'show table extended in <schema> like '*'dbt docs generatecan suffer from this single call, especially in the schema/project cases aboveshow table extended in <schema> like '*'can also be quite lengthy inside schemas that are quite large and/or contain many objects not related to the project/model docs is being generated for/onshow table extended in <schema> lilke '*'from the spark cluster as well as DBT command line for both DBT Compile & DBT Rundbt-spark wip-faster-caching-option1
show table extended in <schema> like <relation object|relation object|relation object|....>or as I will refer to itshow table extended in <schema> like <list>like <list>for enumerationlike <list>enumerationlike <list>like <list>“limit” testing - Tested with 1,578 Modelslike <list>was to largeshow table extended in <schema> like <list>from the spark cluster as well as DBT command line for both DBT Compile and DBT Rundbt-spark wip-faster-caching-option2
show tables in <schema>show views in <schema>followed by
describe extended <objects>for any “refs” inside the model.sqlShow Tables & Show Views in <schema>is lightning fastdescribe extended <object>is quick as welldescribe extended1 by 1 for these commands compared tolike <list>|'*'describe extended <object>doesn’t provide whether the column is nullable or notshow table extended in <schema> like <list>|'*'anddescribe extended <object>dbt_project.ymlabsolutely has+file_format: deltaset though, maybe that’s entirely why this passed my testingshow tables in <schema> & show views in <schema>from the spark cluster as well as DBT command line for both DBT Compile and DBT RunAdditional Context
How we are executing DBT Projects/Models at Slickdeals
I understand this might not be the recommended or normal approach. Below is an example Screenshot of our implementation and notes following
DBT Run --select model.sqlmodel.sqlis every individual model in the graphshow table extended in <schema> like '*'for every model (nearly 900+ times currently)I totally understand as well that our implementation might not be normal or recommended and is part, if not entirely, the reason we are experiencing problems. Of course, none of these are your problems but I greatly appreciate the work being put into making caching faster overall
Reasons why we went the above Implementation route
Final Thoughts
show table extended in <schema> like '*'show table extended in <schema> like <list>like '*'show tables in <schema>show views in <schema>followed bydescribe extended <objects>for any “refs” inside the model.sqldescribe extended tableapproach indicated in your last paragraph here https://github.com/dbt-labs/dbt-spark/issues/295#issue-1149338024is_deltadescribe extended <object>to get these results, but could those then be cached?describe extended <object>againI don’t want you to have to finagle support for both Option1 and Option2
describe extendedcall must be accepted and that might be a sort of return back to the slower approach that you mention in your last paragraph here https://github.com/dbt-labs/dbt-spark/issues/295#issue-1149338024Again, thank you very much for the two options, and to everyone if you got this far kudos!
We have the same issue, but our problem is not focused on performance but rather on the loss of information. We are using iceberg’s upsert functionality and, when this error occurs, dbt assumes that the models do not exist and instead of doing an upsert it executes a drop and create resulting in the loss of the previous information. Has anyone experienced the same problem as us? @jtcohen6 do you have any thoughs on this?
Is there any progress with a fix? The solution jtcohen6 suggested really benefits our use-case… Is there any estimations?
Very interested in this fix – @jtcohen6 @TalkWIthKeyboard do you think there’s anything I could do to help with #433, or do you think it’s best to let y’all handle it?
We are also facing the same issue with iceberg. Ideally it should explicitly fail if the
show tablescommand fails instead of retrying.The failure is an intermittent one and get resolved in the rerun. However, there is an explicit loss of data because of the
create or replacecommand.We are facing this issue in dbt-spark 1.6.0
I’m working with a dbt Cloud customer who have an existing, long-lived Spark cluster and are keen to try dbt Core/Cloud with it. They have 10,000+ objects in Spark and are experiencing this issue. They are following along to see its resolution!
Yes, I think that’s right. Let’s aim to include this change in a v1.1 beta, and ask folks to test it out in their own projects to ensure it yields the desired speedups.
cc @VersusFacit: This is another cache-related performance question, similar to https://github.com/dbt-labs/dbt-snowflake/issues/83, where benchmarking (and anticipating edge cases) will be the name of the game.
The change proposed in this issue will significantly help in cases where dbt manages some objects, in large databases that also contain thousands of non-dbt objects. If we want to speed up cases where dbt manages thousands of objects, and a given invocation is only selecting a handful to run, we’ll need to tackle the larger question of selection-based caching: https://github.com/dbt-labs/dbt-core/issues/4688
Then the path forward seems to be the implementation of your original idea: to generate the
like 'table_one|table_two|...'from DBT’s project information, is that right?