dbt-core: Record failing rows for tests into an auditable table

Feature

Feature description

dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:

  1. insert into an errors table with a schema like:
  • run started at
  • invocation id
  • test name
  • failing row primary key / json
  1. Create a table for each test failure in a scratch schema, matching the schema of the test query that failed

The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.

This should probably be an opt-in feature for tests. It could be configured either in schema.yml, though it would also be good to configure whole swaths of tests at once.

Who will this benefit?

dbt users could use this table to quickly determine why their tests failed. At present, it’s surprisingly difficult to debug a failed test.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 33
  • Comments: 17 (5 by maintainers)

Most upvoted comments

I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional “Go Fix ETL” alert.

When I went looking to do this feature I naturally reached for hooks, as I’m using them already for audit purposes much as described in the documentation.

I’m logging the individual model run:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"

models:
  pre-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
  post-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())"

As well as the overall results:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

(I have a simple macro results_values to turn the results into insertable values.)

In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-run-start/on-run-end with the end run context results would suit my needs:

on-test-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
    - "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}"

Hi @Aylr and @gordonhwong!

In case it’s still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: https://github.com/fishtown-analytics/dbt/issues/2593

that looks like a great start to a ‘dbt_results_mart’, where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

This is exactly what I am looking for! IMO a run / test / error mart is a prerequisite for any mature data pipeline. Especially to track data run duration and data quality over time. Such a model can then be used as a source for firing alerts when data quality starts to deviate from an average , a run deviates from an average etc…

Of course one could do this by bringing in other tools that help with this (like https://docs.greatexpectations.io, a tool I have seen being suggested a few times). But imo this should be part of the CORE of dbt; for me it falls under the altering and logging features displayed here: https://www.getdbt.com/product/

Here’s an easy macro for this.

{% macro unpack_results(results) %}

  {% for res in results -%}
-- This is useful for debugging
--     {{ log(res, info=True) }}
        ('{{ invocation_id }}', '{{run_started_at}}', NOW(), '{{ res.node.unique_id }}', '{{ res.status }}', '{{ res.error if res.error != None else ""}}', '{{ res.skip if res.skip != None else "" }}', '{{ res.fail if res.fail != None else "" }}', {{ res.execution_time }} ){{ "," if not loop.last }}
  {% endfor %}

{% endmacro %}

Wanted to share the approach we’re taking for visibility of failing tests. I don’t think it’d scale to all 1200+ tests, but right now we don’t need it to - we’re focused on exposing data quality issues to other teams, in the hope that showing them a chart with number of errors going down will encourage them to keep the CRM tidy 🤞

We’re using dbt-utils’ get_relations_by_pattern and union_relations macros, and then will snapshot the results of that.

{% set relations = dbt_utils.get_relations_by_pattern('%', 'hubspot_revenue_test__%') %}

with tests as (
    {{ dbt_utils.union_relations(relations, exclude=['_fivetran_synced']) }}
)

select 
    initcap(translate(regexp_replace(_dbt_source_relation, '["A-z\.]+hubspot_revenue_test__', ''), '_"', ' '))  as failing_test, --icky, but works to make the error friendlier
    pipeline_label as team, 
    count(*) as num_records
from tests
group by 1, 2 

As much as we’d love to get amongst the artifact goodness, I don’t think Redshift doesn’t have a PUT equivalent so we’re probably dependent on a dbt Cloud native solution for the moment.