dbt-core: Record failing rows for tests into an auditable table
Feature
Feature description
dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:
- insert into an
errorstable with a schema like:
- run started at
- invocation id
- test name
- failing row primary key / json
- Create a table for each test failure in a scratch schema, matching the schema of the test query that failed
The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.
This should probably be an opt-in feature for tests. It could be configured either in schema.yml, though it would also be good to configure whole swaths of tests at once.
Who will this benefit?
dbt users could use this table to quickly determine why their tests failed. At present, it’s surprisingly difficult to debug a failed test.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 33
- Comments: 17 (5 by maintainers)
I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional “Go Fix ETL” alert.
When I went looking to do this feature I naturally reached for hooks, as I’m using them already for audit purposes much as described in the documentation.
I’m logging the individual model run:
As well as the overall results:
(I have a simple macro
results_valuesto turn theresultsinto insertable values.)In this case I would generally like the exact same thing for tests. Something exactly analogous to the
on-run-start/on-run-endwith the end run contextresultswould suit my needs:Hi @Aylr and @gordonhwong!
In case it’s still useful for anyone, the following macro is what I use:
This pairs with:
Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: https://github.com/fishtown-analytics/dbt/issues/2593
This is exactly what I am looking for! IMO a run / test / error mart is a prerequisite for any mature data pipeline. Especially to track data run duration and data quality over time. Such a model can then be used as a source for firing alerts when data quality starts to deviate from an average , a run deviates from an average etc…
Of course one could do this by bringing in other tools that help with this (like https://docs.greatexpectations.io, a tool I have seen being suggested a few times). But imo this should be part of the CORE of dbt; for me it falls under the altering and logging features displayed here: https://www.getdbt.com/product/
Here’s an easy macro for this.
Wanted to share the approach we’re taking for visibility of failing tests. I don’t think it’d scale to all 1200+ tests, but right now we don’t need it to - we’re focused on exposing data quality issues to other teams, in the hope that showing them a chart with number of errors going down will encourage them to keep the CRM tidy 🤞
We’re using dbt-utils’ get_relations_by_pattern and union_relations macros, and then will snapshot the results of that.
As much as we’d love to get amongst the artifact goodness, I don’t think Redshift doesn’t have a PUT equivalent so we’re probably dependent on a dbt Cloud native solution for the moment.