dbt-core: [CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can’t go with just run as tests required immediately before proceed to next model in DAG.

Expected Behavior

Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.

Steps To Reproduce

version: dbt 1.0 till latest release 1.3

total nodes count: 3K+ including models, tests, snapshots, seeds

build single model: $dbt build -s test_model

Relevant log output

No response

Environment

- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (16 by maintainers)

Most upvoted comments

Hi @misteliy,

Awaiting dbt-core team review on pull request. It’s been a while I don’t see any update. Let me try update and ask for the status.

thanks

Hi Tobie I’ve checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn’t do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we’ve removed nodes 5 and 7 from our graph even though they were selected!

Here’s what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that’s not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

thank you @jtcohen6 for the response and references to ongoing discussions.

Have gone through code snippet where its taking time to build graph… I’m intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.

image

As per my understanding of the same logic, it can be achieved through just using subgraph like below… It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?

image

Thank you!!!