dbt-bigquery: [CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery)

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Hi there

I have seen a similar bug here https://github.com/dbt-labs/dbt-core/issues/1576 and the issue is closed, but it appears to still be long for me to generate docs because (I believe) of the sharded tables in BQ (the ga_sessions_* from GA360)

Right now it takes around 30min and we have a little bit more of 10K tables

Expected Behavior

I would expect the dbt docs generate command to finish generating the catalog within a minute

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS:
- Python: 3.7
- dbt: 1.0.0

What database are you using dbt with?

bigquery

Additional Context

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 27 (7 by maintainers)

Most upvoted comments

Thanks both for weighing in! That’s a sizeable chunk of (meta)data. I’m not surprised to hear that dbt’s processes are slower at that scale, though it’s good to know that the bottleneck here may be the BigQuery client’s API.

The current query is filtered to just the datasets that dbt is interested in, but I take it that you might have either/both of:

  • source datasets with many thousands of tables, of which a small subset are actually referenced by dbt models
  • lots of non-dbt objects living in the same datasets as dbt models/seeds/snapshots/etc

If you have any control over those, such as by isolating dbt sources and models into dedicated datasets, that would help significantly.

Otherwise, we could try filtering further in the query, to just the objects (tables/views) that map to sources/models/etc in your dbt project. To do that, we’d need to rework _get_catalog_schemas to return a SchemaSearchMap() that contains a list of specific relation identifiers, and template them into the catalog query, up to X specific identifiers (<1 MB limit for StandardSQL query text).

The logs/dbt.log catalog query returns nearly instantaneously for me when I run it in the web interface. What I see in the logs as the last entry is the templated query. Then there are no logs for 10 minutes while it is building catalog. If I attach lldb to the running subprocess dbt spawns the names of the frames as I periodically interrupt the process suggest it is munging dictionaries and tuples and such for these 10 minutes. I haven’t figured out how to attach a real python debugger to the subprocess on my M1 Mac yet. My hunch therefore is that the issue may lie in how the catalog is actually built, not in the catalog query.