dbt-core: Duplicate Rows detected during snapshot
We have been battling a dbt bug for several months now that we were hopeful was solved in the release of 0.17.0.
Consistently, the snapshot of a table we have breaks due to the following error:
Database Error in snapshot user_campaign_audit (snapshots/user_campaign_audit.sql) 100090 (42P18): Duplicate row detected during DML action
Checking our snapshot table, there are indeed multiple rows with identical dbt_scd_ids. The table being snapshot changes it’s schema with relatively high frequency. It’s a core table that feeds a lot of downstream tables, so new columns are added fairly often. We also run a production dbt run every time we merge a branch into our master branch (we are running dbt on a GItlab CI/CD flow), so the snapshot can run multiple times a day.
Our current approach to fix this is to create a copy of the snapshot table, reduce it to every distinct record, and then use that as the production version of the table. Something like:
create broken_audit_table as (select distinct * from audit_table);
alter table broken_audit_table swap with audit_table;
'grant ownership on audit_table to role dbt;
Let me know if there is any more detail I can provide. Full stack is Fivetran/Snowflake/dbt
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (6 by maintainers)
@ddppi first of all, I loved the recursive structure of your troubleshooting steps. Here’s a time lapse of me walking through them making sure I understand everything correctly:
Secondly, some good news!
We have a timely discussion going on in https://github.com/dbt-labs/dbt-core/issues/6089 that contains a proposal to proactively search for duplicates before merging into the final snapshot table. The summary of the current proposal is here. Would welcome you to join that discussion and contribute any feedback you have.
we recently had the same error, and eventually we found it is due to duplications in the underlying table in RAW DB and it is tricky to identify initially.
some of our experience while we trouble shot our error
I have the same issue on exasol with error message: “Unable to get a stable set of rows in the source tables” and there are duplicate lines in the temp table before merge, even though the soures are clean. I figured out, that a single quote within a varchar column caused the problem, after excluding all rows with single quotes in the string, the duplicates where gone
I think in my case the issue may have been caused by running two instances of dbt concurrently. We have been migrating Airflow instances and had a dbt dag running on both instances at one point. I suspect that the snapshot command ran at the same time on both by accident and this is the root cause on my case.
Hey @atvaccaro, it is technically possible to force
dbt snapshotto fully copy the table each time. You’d do this by setting theunique_keyto something globally unique, e.g. a surrogate key that hashes together your actual primary key with thecurrent_date.Such an approach doesn’t use a snapshot strategy to detect changes; it’s just append-only. That means the snapshot is:
dbt snapshotonce or many times returns different results. (Rather than recording only the rows that are different, dbt will append another entire copy of the table.)It’s not something I’d ever recommend, but you’re welcome to implement it in your own project.
This issue sounds like the crux of the matter:
Which snapshot strategy are you using? If it’s the
timestampstrategy, dbt creates a hash of the configuredunique_keyandupdated_at. If it’s thecheckstrategy, dbt creates a hash of the configuredunique_keyand the current timestamp.Therefore, the way to end up with duplicate
dbt_scd_idis:timestampstrategy: there exist multiple rows in your underlying table with the same values of unique key + updated timestampcheckstrategy: multipledbt snapshotinvocations at exactly the same time (unlikely)If the underlying table is liable to having exact duplicate records, you can add logic to the snapshot query or an intermediate model to remove those duplicates.