dbt-bigquery: [CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery)
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
Hi there
I have seen a similar bug here https://github.com/dbt-labs/dbt-core/issues/1576 and the issue is closed, but it appears to still be long for me to generate docs because (I believe) of the sharded tables in BQ (the ga_sessions_* from GA360)
Right now it takes around 30min and we have a little bit more of 10K tables
Expected Behavior
I would expect the dbt docs generate command to finish generating the catalog within a minute
Steps To Reproduce
No response
Relevant log output
No response
Environment
- OS:
- Python: 3.7
- dbt: 1.0.0
What database are you using dbt with?
bigquery
Additional Context
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 27 (7 by maintainers)
Thanks both for weighing in! That’s a sizeable chunk of (meta)data. I’m not surprised to hear that dbt’s processes are slower at that scale, though it’s good to know that the bottleneck here may be the BigQuery client’s API.
The current query is filtered to just the datasets that dbt is interested in, but I take it that you might have either/both of:
If you have any control over those, such as by isolating dbt sources and models into dedicated datasets, that would help significantly.
Otherwise, we could try filtering further in the query, to just the objects (tables/views) that map to sources/models/etc in your dbt project. To do that, we’d need to rework
_get_catalog_schemasto return aSchemaSearchMap()that contains a list of specific relation identifiers, and template them into the catalog query, up to X specific identifiers (<1 MB limit for StandardSQL query text).The
logs/dbt.logcatalog query returns nearly instantaneously for me when I run it in the web interface. What I see in the logs as the last entry is the templated query. Then there are no logs for 10 minutes while it is building catalog. If I attach lldb to the running subprocess dbt spawns the names of the frames as I periodically interrupt the process suggest it is munging dictionaries and tuples and such for these 10 minutes. I haven’t figured out how to attach a real python debugger to the subprocess on my M1 Mac yet. My hunch therefore is that the issue may lie in how the catalog is actually built, not in the catalog query.