dbt-bigquery: [CT-2051] [Bug] New insert_overwrite Bigquery partitioning with integer keys can create huge temporary array variables, exceeding BQ limits

Describe the bug

When doing an incremental update of an integer-partitioned bigquery table with the new insert_overwrite merge strategy then DBT calculates which partitions it should replace. In the process it can generate a huge BigQuery ARRAY value.

The cause is that DBT does not take the “interval” parameter of the partitioning specification into account. The generated SQL code selects “array_agg(distinct PARTITION_FIELD)” when calculating the partitions to replace. This selects ALL distinct partition field values of the incremental update, even if these values are actually in the same partition. This causes a potentially huge array to be created. If there is enough data in the table then this will even cause an error because BQ’s maximum array size (100 MB) is exceeded.

Note that this bug is not triggered by time partitioned tables because for those all partition fields are dates and the partition size is always one day (i.e. there is only one valid value per partition).

Steps To Reproduce

Generate a model with: config( materialized = ‘incremental’, unique_key = ‘partkey’,

partition_by = { “field”: “partkey”, “data_type”: “int64”,
```
 "range": {
     "start": 0,
     "end": 100000,
     "interval": 1000
 }
```
}, incremental_strategy = ‘insert_overwrite’ )
Run the model once with empty source data
Add 1000 records in the source data with values for partkey from 0 to 999
Generate the incremental update code for the model and look at the values in the dbt_partitions_for_replacement internal variable.
You will see that it contains all 1000 values for partkey from 0 to 999, even though those are all inside the same single partition.

Expected behavior

DBT should ensure that no huge temporary array variable is generated. The dbt_partitions_for_replacement array should have at most as many elements as the number of partitions being updated. In my opinion the way to go would be to store only the starting values of each partition in the array and then modify the merge clause to use a range for each partition.

Screenshots and log output

N/A

System information

Which database are you using dbt with?

postgres
redshift
bigquery
snowflake
other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.0

Up to date!

The operating system you’re using:

macOS

The output of python --version: Python 3.7.4

Additional context

About this issue

Original URL
State: open
Created 4 years ago
Comments: 26 (12 by maintainers)

Most upvoted comments

I recently refactored many incremental models I maintain to make use of the insert_overwrite strategy to reduce query cost and this appeared to work out pretty well. However, when dealing with integer partitioned tables I ran into similar (or identical) limitations discussed in this issue. The discussed “min_max” strategy deviates from my approach so I would like to add something to the discussion.

I startet by looking at the compiled SQL and found:

      -- 2. define partitions to update
      set (dbt_partitions_for_replacement) = (
          select as struct
              array_agg(distinct **paritioned_by_column**)
          from `**table**__dbt_tmp`
      );

with

    when not matched by source
         and DBT_INTERNAL_DEST.**paritioned_by_column** in unnest(dbt_partitions_for_replacement) 
        then delete

As already pointed out in this issue the array in dbt_partitions_for_replacement does not refer directly to the partitions of the table but all elements in that array are within the partitions that are targeted for an update. Something I found interesting is that the query cost is as large as if the table would have no partitioning at all if the array becomes too long. I assume that the query optimizer of BigQuery prefers speed over cost at a certain point. Anyway, I changed to code to the following to achieve the cost-reduction while keeping the functionality I expected from the insert_overwrite method:

      -- 2. define partitions to update
      set (dbt_partitions_for_replacement) = (
          select as struct
              array_agg(distinct (**paritioned_by_column** - mod(**paritioned_by_column**, **parition_interval**) ) )
          from `**table**__dbt_tmp`
      );

and


    when not matched by source
         and ( false
         or DBT_INTERNAL_DEST.**paritioned_by_column** between dbt_partitions_for_replacement[offset(0)] and (dbt_partitions_for_replacement[offset(0)] + **parition_interval** - 1)
         or DBT_INTERNAL_DEST.**paritioned_by_column** between dbt_partitions_for_replacement[offset(1)] and (dbt_partitions_for_replacement[offset(1)] + **parition_interval** - 1)
         )
        then delete

Note that I had two partitions to replace in my example.

This code drops the entire partition and writes it again, which is the behavior I assumed from reading the documentation, i.e. the line ‘[…] because it operates on partitions of data rather than individual rows. […]’. Using a replace_partitions flag sounds fine to me as well.

After proposing my view on the topic: @elyobo, @Fleid, is someone working on that issue at the moment? If a change is in the pipeline I would lean back and wait for an update. If not, I could try to follow the process to contribute to dbt myself.

PS: dbt is a great piece of software 😃

tmatalla-wagner on May 10, 2023

@elyobo: the array length of dbt_partitions_for_replacement was around 30k.

@Fleid, @dbeatty10: I will wait for your response whether the dbt core team is working on that issue. If not, I would try to open a PR addressing the issue while (hopefully) following your desired process for contributing 😃

tmatalla-wagner on May 16, 2023

@Fleid happy to close it and have done so.

elyobo on Feb 9, 2023

Just a side note: BigQuery has a maximum limit of 10.000 partitions. So I do not think that a partition list could cause a crash in the BigQuery case (don’t know about other databases, though). Also, in my experience you normally just replace a few chosen partitions with insert_overwrite, so the list is likely very short.

Having said that: I like the min/max idea better. It is more generic and should work under all circumstances, even edge cases, and with all backends. But if that does not work out for some reason, then I think listing the partitions would also be fine.

haukeduden on Feb 2, 2023

That sounds pretty good; enum instead of boolean is sensible.

The min/max approach would also work with date partitioning but is unlikely to be a good choice - sequential integer IDs can easily exceed the BQ scripting memory limits, but the number of unique dates in a set of data are likely to be far smaller and fit within the limits. I wouldn’t prevent such a choice but would definitely not default to it and would mention this in the docs.

My head is totally out of this space at the moment so will take me some time to dig back in, but would be happy to give it a go once I get some time to wrap my head around it again.

elyobo on Feb 2, 2023

Re-opening that one following @elyobo’s PR (#371).

Trying to sum-up the entire situation before diving in:

Adapter : BigQuery
- Materialization : Incremental
  - Strategy : insert_overwrite | ~~merge~~
    - Partitioning types (BQ specific) : integer range, time-unit column, ingestion time
    - Partitions to overwrite : dynamic | static
    - copy_partitions : switch to bq cp for better performance

Those are the options currently available to one exploring that space.

Here we are talking about an optimization of the specific case of insert_overwrite + integer range + dynamic, regardless of copy_partitions.

Currently when generating the list of partitions to process, dbt generates the distinct list of values. Optimally, what we would need instead is the distinct list of partitions (see above for why). This is what @jtcohen6 mentioned up above almost 3 years ago (omg). I’m not sure why we abandoned that plan?

In the meantime, @elyobo you offer instead to pick the min and max values from that list, and process every partitions in between - even if they don’t need to be. Not as optimal as targeting specific partitions, but better than the current implementation in some scenarios. Now we would need a knob to turn that on/off since the behavior could makes things worst in some situations.

Did you explore the original idea of generating the distinct list of partitions? I’d rather we make the thing better all the time, and not have to surface another setting to a situation already quite complex in my opinion.

@github-christophe-oudar I know you’re knowledgeable about that area. Any wisdom to share? 😉

Fleid on Jan 30, 2023