dbt-bigquery: [CT-2051] [Bug] New insert_overwrite Bigquery partitioning with integer keys can create huge temporary array variables, exceeding BQ limits
Describe the bug
When doing an incremental update of an integer-partitioned bigquery table with the new insert_overwrite merge strategy then DBT calculates which partitions it should replace. In the process it can generate a huge BigQuery ARRAY value.
The cause is that DBT does not take the “interval” parameter of the partitioning specification into account. The generated SQL code selects “array_agg(distinct PARTITION_FIELD)” when calculating the partitions to replace. This selects ALL distinct partition field values of the incremental update, even if these values are actually in the same partition. This causes a potentially huge array to be created. If there is enough data in the table then this will even cause an error because BQ’s maximum array size (100 MB) is exceeded.
Note that this bug is not triggered by time partitioned tables because for those all partition fields are dates and the partition size is always one day (i.e. there is only one valid value per partition).
Steps To Reproduce
-
Generate a model with: config( materialized = ‘incremental’, unique_key = ‘partkey’,
partition_by = { “field”: “partkey”, “data_type”: “int64”,
"range": { "start": 0, "end": 100000, "interval": 1000 }}, incremental_strategy = ‘insert_overwrite’ )
-
Run the model once with empty source data
-
Add 1000 records in the source data with values for partkey from 0 to 999
-
Generate the incremental update code for the model and look at the values in the dbt_partitions_for_replacement internal variable.
-
You will see that it contains all 1000 values for partkey from 0 to 999, even though those are all inside the same single partition.
Expected behavior
DBT should ensure that no huge temporary array variable is generated. The dbt_partitions_for_replacement array should have at most as many elements as the number of partitions being updated. In my opinion the way to go would be to store only the starting values of each partition in the array and then modify the merge clause to use a range for each partition.
Screenshots and log output
N/A
System information
Which database are you using dbt with?
- postgres
- redshift
- bigquery
- snowflake
- other (specify: ____________)
The output of dbt --version:
installed version: 0.16.0
latest version: 0.16.0
Up to date!
The operating system you’re using:
macOS
The output of python --version:
Python 3.7.4
Additional context
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 26 (12 by maintainers)
I recently refactored many incremental models I maintain to make use of the
insert_overwritestrategy to reduce query cost and this appeared to work out pretty well. However, when dealing with integer partitioned tables I ran into similar (or identical) limitations discussed in this issue. The discussed “min_max” strategy deviates from my approach so I would like to add something to the discussion.I startet by looking at the compiled SQL and found:
with
As already pointed out in this issue the array in
dbt_partitions_for_replacementdoes not refer directly to the partitions of the table but all elements in that array are within the partitions that are targeted for an update. Something I found interesting is that the query cost is as large as if the table would have no partitioning at all if the array becomes too long. I assume that the query optimizer of BigQuery prefers speed over cost at a certain point. Anyway, I changed to code to the following to achieve the cost-reduction while keeping the functionality I expected from the insert_overwrite method:and
Note that I had two partitions to replace in my example.
This code drops the entire partition and writes it again, which is the behavior I assumed from reading the documentation, i.e. the line ‘[…] because it operates on partitions of data rather than individual rows. […]’. Using a
replace_partitionsflag sounds fine to me as well.After proposing my view on the topic: @elyobo, @Fleid, is someone working on that issue at the moment? If a change is in the pipeline I would lean back and wait for an update. If not, I could try to follow the process to contribute to dbt myself.
PS: dbt is a great piece of software 😃
@elyobo: the array length of
dbt_partitions_for_replacementwas around 30k.@Fleid, @dbeatty10: I will wait for your response whether the dbt core team is working on that issue. If not, I would try to open a PR addressing the issue while (hopefully) following your desired process for contributing 😃
@Fleid happy to close it and have done so.
Just a side note: BigQuery has a maximum limit of 10.000 partitions. So I do not think that a partition list could cause a crash in the BigQuery case (don’t know about other databases, though). Also, in my experience you normally just replace a few chosen partitions with insert_overwrite, so the list is likely very short.
Having said that: I like the min/max idea better. It is more generic and should work under all circumstances, even edge cases, and with all backends. But if that does not work out for some reason, then I think listing the partitions would also be fine.
That sounds pretty good; enum instead of boolean is sensible.
The min/max approach would also work with date partitioning but is unlikely to be a good choice - sequential integer IDs can easily exceed the BQ scripting memory limits, but the number of unique dates in a set of data are likely to be far smaller and fit within the limits. I wouldn’t prevent such a choice but would definitely not default to it and would mention this in the docs.
My head is totally out of this space at the moment so will take me some time to dig back in, but would be happy to give it a go once I get some time to wrap my head around it again.
Re-opening that one following @elyobo’s PR (#371).
Trying to sum-up the entire situation before diving in:
insert_overwrite|mergeinteger range,time-unit column,ingestion timedynamic|staticcopy_partitions: switch tobq cpfor better performanceThose are the options currently available to one exploring that space.
Here we are talking about an optimization of the specific case of
insert_overwrite+integer range+dynamic, regardless ofcopy_partitions.Currently when generating the list of partitions to process, dbt generates the distinct list of values. Optimally, what we would need instead is the distinct list of partitions (see above for why). This is what @jtcohen6 mentioned up above almost 3 years ago (omg). I’m not sure why we abandoned that plan?
In the meantime, @elyobo you offer instead to pick the min and max values from that list, and process every partitions in between - even if they don’t need to be. Not as optimal as targeting specific partitions, but better than the current implementation in some scenarios. Now we would need a knob to turn that on/off since the behavior could makes things worst in some situations.
Did you explore the original idea of generating the distinct list of partitions? I’d rather we make the thing better all the time, and not have to surface another setting to a situation already quite complex in my opinion.
@github-christophe-oudar I know you’re knowledgeable about that area. Any wisdom to share? 😉