dbt-core: [CT-697] [Bug] Seeds are considerably slower with dbt >= 1.0

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

There appear to be some degraded performance when using dbt seed with dbt >= 1.0 (like 10x slower).

-!         CAVEAT FOR ISSUE READERS           !-

dbt doesn’t recommend the seed functionality for large seed files in the first place.

Expected Behavior

dbt seed should perform as it did prior to 1.0.

Steps To Reproduce

Setup

  1. Install docker + clone the dbt-core repo + follow the initial setup in the contributing docs - this is to setup the postgres db locally (skip if you already have a local postgres db setup).
  2. Generate a csv file with 10_000 rows (I used this gist) and copy it to your seeds/data path in your dbt project.
  3. Add the seeds to your dbt_project.yml file, mine was something like this:
name: "postgres"
version: "1.0.0"
config-version: 2
profile: "postgres"
data-paths: ["seeds"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_modules"
  - "dbt_packages"

models:
  postgres:
    +materialized: table

seeds:
  postgres:
    +enabled: true
    +quote_columns: false

Install dbt + seed your csv

All the following were installed into separate virtual envs.

Note that the seed my_seed_a already exists in the db so dbt always did a truncate in the steps below. The same degraded performance was experienced if the seed hadn’t already exist in the db, which means dbt would do create instead of truncate.

dbt 0.18.1

pip install dbt==0.18.1
pip install markupsafe==2.0.1
dbt seed
2022-05-31 05:57:41.850908 (MainThread): 17:57:41 | Concurrency: 8 threads (target='dev')
2022-05-31 05:57:41.851093 (MainThread): 17:57:41 | 
2022-05-31 05:57:41.857162 (Thread-1): Began running node seed.postgres.my_seed_a
2022-05-31 05:57:41.858582 (Thread-1): 17:57:41 | 1 of 1 START seed file public.my_seed_a.............................. [RUN]
2022-05-31 05:57:41.859322 (Thread-1): Acquiring new postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:41.859650 (Thread-1): finished collecting timing info
2022-05-31 05:57:43.044005 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.044181 (Thread-1): On seed.postgres.my_seed_a: BEGIN
2022-05-31 05:57:43.044270 (Thread-1): Opening a new connection, currently in state closed
2022-05-31 05:57:43.056447 (Thread-1): SQL status: BEGIN in 0.01 seconds
2022-05-31 05:57:43.056643 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.056732 (Thread-1): On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "0.18.1", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
2022-05-31 05:57:43.062752 (Thread-1): SQL status: TRUNCATE TABLE in 0.01 seconds
2022-05-31 05:57:43.444330 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.444525 (Thread-1): On seed.postgres.my_seed_a: 
            insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
            (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%...
2022-05-31 05:57:43.787235 (Thread-1): SQL status: INSERT 0 10000 in 0.34 seconds
2022-05-31 05:57:43.792411 (Thread-1): Writing runtime SQL for node "seed.postgres.my_seed_a"
2022-05-31 05:57:43.800577 (Thread-1): On seed.postgres.my_seed_a: COMMIT
2022-05-31 05:57:43.800747 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.800831 (Thread-1): On seed.postgres.my_seed_a: COMMIT
2022-05-31 05:57:43.810414 (Thread-1): SQL status: COMMIT in 0.01 seconds
2022-05-31 05:57:43.812294 (Thread-1): finished collecting timing info
2022-05-31 05:57:43.812579 (Thread-1): On seed.postgres.my_seed_a: Close
2022-05-31 05:57:43.814104 (Thread-1): Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '56fff452-6d07-4d50-afd8-95b873e36e9d', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10a89ee20>]}
2022-05-31 05:57:43.814415 (Thread-1): 17:57:43 | 1 of 1 OK loaded seed file public.my_seed_a.......................... [INSERT 10000 in 1.95s]
2022-05-31 05:57:43.814566 (Thread-1): Finished running node seed.postgres.my_seed_a
2022-05-31 05:57:43.816776 (MainThread): Acquiring new postgres connection "master".
2022-05-31 05:57:43.817080 (MainThread): Using postgres connection "master".
2022-05-31 05:57:43.817243 (MainThread): On master: BEGIN
2022-05-31 05:57:43.817379 (MainThread): Opening a new connection, currently in state closed
2022-05-31 05:57:43.835475 (MainThread): SQL status: BEGIN in 0.02 seconds
2022-05-31 05:57:43.835730 (MainThread): On master: COMMIT
2022-05-31 05:57:43.835850 (MainThread): Using postgres connection "master".
2022-05-31 05:57:43.835953 (MainThread): On master: COMMIT
2022-05-31 05:57:43.837042 (MainThread): SQL status: COMMIT in 0.00 seconds
2022-05-31 05:57:43.837297 (MainThread): On master: Close
2022-05-31 05:57:43.837763 (MainThread): 17:57:43 | 
2022-05-31 05:57:43.837974 (MainThread): 17:57:43 | Finished running 1 seed in 2.14s.
2022-05-31 05:57:43.838150 (MainThread): Connection 'master' was properly closed.
2022-05-31 05:57:43.838276 (MainThread): Connection 'seed.postgres.my_seed_a' was properly closed.
2022-05-31 05:57:43.847005 (MainThread): 
2022-05-31 05:57:43.847287 (MainThread): Completed successfully
2022-05-31 05:57:43.847494 (MainThread): 
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

dbt 1.0.7

pip install dbt-core==1.0.7 dbt-postgres==1.0.7
dbt seed
05:58:27.677062 [info ] [MainThread]: Concurrency: 8 threads (target='dev')
05:58:27.677869 [info ] [MainThread]: 
05:58:27.684727 [debug] [Thread-1  ]: Began running node seed.postgres.my_seed_a
05:58:27.685266 [info ] [Thread-1  ]: 1 of 1 START seed file public.my_seed_a......................................... [RUN]
05:58:27.686616 [debug] [Thread-1  ]: Acquiring new postgres connection "seed.postgres.my_seed_a"
05:58:27.687048 [debug] [Thread-1  ]: Began compiling node seed.postgres.my_seed_a
05:58:27.687441 [debug] [Thread-1  ]: finished collecting timing info
05:58:27.687691 [debug] [Thread-1  ]: Began executing node seed.postgres.my_seed_a
05:58:28.955714 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:28.955965 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: BEGIN
05:58:28.956123 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
05:58:28.967233 [debug] [Thread-1  ]: SQL status: BEGIN in 0.01 seconds
05:58:28.967527 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:28.967703 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "1.0.7", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
05:58:28.971334 [debug] [Thread-1  ]: SQL status: TRUNCATE TABLE in 0.0 seconds
05:58:54.523752 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:54.524875 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: 
          insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
          (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s...
05:58:54.874008 [debug] [Thread-1  ]: SQL status: INSERT 0 10000 in 0.35 seconds
05:58:54.879118 [debug] [Thread-1  ]: Writing runtime SQL for node "seed.postgres.my_seed_a"
05:58:54.886762 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: COMMIT
05:58:54.887028 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:54.887209 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: COMMIT
05:58:54.891714 [debug] [Thread-1  ]: SQL status: COMMIT in 0.0 seconds
05:58:54.892380 [debug] [Thread-1  ]: finished collecting timing info
05:58:54.892604 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: Close
05:58:54.894010 [debug] [Thread-1  ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'd718dd74-94d7-4dd4-b596-8682f54f3943', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x109e07ee0>]}
05:58:54.894367 [info ] [Thread-1  ]: 1 of 1 OK loaded seed file public.my_seed_a..................................... [INSERT 10000 in 27.21s]
05:58:54.894741 [debug] [Thread-1  ]: Finished running node seed.postgres.my_seed_a
05:58:54.896836 [debug] [MainThread]: Acquiring new postgres connection "master"
05:58:54.897364 [debug] [MainThread]: Using postgres connection "master"
05:58:54.897733 [debug] [MainThread]: On master: BEGIN
05:58:54.898062 [debug] [MainThread]: Opening a new connection, currently in state closed
05:58:54.931927 [debug] [MainThread]: SQL status: BEGIN in 0.03 seconds
05:58:54.932395 [debug] [MainThread]: On master: COMMIT
05:58:54.932709 [debug] [MainThread]: Using postgres connection "master"
05:58:54.933003 [debug] [MainThread]: On master: COMMIT
05:58:54.934533 [debug] [MainThread]: SQL status: COMMIT in 0.0 seconds
05:58:54.934977 [debug] [MainThread]: On master: Close
05:58:54.935634 [info ] [MainThread]: 
05:58:54.936256 [info ] [MainThread]: Finished running 1 seed in 27.42s.
05:58:54.936907 [debug] [MainThread]: Connection 'master' was properly closed.
05:58:54.937327 [debug] [MainThread]: Connection 'seed.postgres.my_seed_a' was properly closed.
05:58:54.943688 [info ] [MainThread]: 
05:58:54.944539 [info ] [MainThread]: Completed successfully

dbt 1.1.0

pip install dbt-core==1.1.0 dbt-postgres==1.1.0
dbt seed
08:08:51.605497 [info ] [MainThread]: Concurrency: 8 threads (target='dev')
08:08:51.605966 [info ] [MainThread]: 
08:08:51.611963 [debug] [Thread-1  ]: Began running node seed.postgres.my_seed_a
08:08:51.612711 [info ] [Thread-1  ]: 1 of 1 START seed file public.my_seed_a ........................................ [RUN]
08:08:51.614090 [debug] [Thread-1  ]: Acquiring new postgres connection "seed.postgres.my_seed_a"
08:08:51.615405 [debug] [Thread-1  ]: Began compiling node seed.postgres.my_seed_a
08:08:51.616236 [debug] [Thread-1  ]: finished collecting timing info
08:08:51.616654 [debug] [Thread-1  ]: Began executing node seed.postgres.my_seed_a
08:08:52.871344 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
08:08:52.871767 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: BEGIN
08:08:52.872066 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
08:08:52.887434 [debug] [Thread-1  ]: SQL status: BEGIN in 0.02 seconds
08:08:52.887853 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
08:08:52.888124 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
08:08:52.893625 [debug] [Thread-1  ]: SQL status: TRUNCATE TABLE in 0.01 seconds
08:09:18.845472 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:18.845913 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: 
          insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
          (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s...
08:09:19.226980 [debug] [Thread-1  ]: SQL status: INSERT 0 10000 in 0.38 seconds
08:09:19.232895 [debug] [Thread-1  ]: Writing runtime SQL for node "seed.postgres.my_seed_a"
08:09:19.242962 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: COMMIT
08:09:19.243377 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:19.243676 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: COMMIT
08:09:19.256668 [debug] [Thread-1  ]: SQL status: COMMIT in 0.01 seconds
08:09:19.258547 [debug] [Thread-1  ]: finished collecting timing info
08:09:19.259305 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: Close
08:09:19.260948 [debug] [Thread-1  ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'efdc63ee-7ed7-4aef-899c-b7004b5a3917', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10705b1c0>]}
08:09:19.261629 [info ] [Thread-1  ]: 1 of 1 OK loaded seed file public.my_seed_a .................................... [INSERT 10000 in 27.65s]
08:09:19.262245 [debug] [Thread-1  ]: Finished running node seed.postgres.my_seed_a
08:09:19.265492 [debug] [MainThread]: Acquiring new postgres connection "master"
08:09:19.266367 [debug] [MainThread]: Using postgres connection "master"
08:09:19.266922 [debug] [MainThread]: On master: BEGIN
08:09:19.267332 [debug] [MainThread]: Opening a new connection, currently in state closed
08:09:19.298020 [debug] [MainThread]: SQL status: BEGIN in 0.03 seconds
08:09:19.298505 [debug] [MainThread]: On master: COMMIT
08:09:19.298843 [debug] [MainThread]: Using postgres connection "master"
08:09:19.299175 [debug] [MainThread]: On master: COMMIT
08:09:19.301994 [debug] [MainThread]: SQL status: COMMIT in 0.0 seconds
08:09:19.302421 [debug] [MainThread]: On master: Close
08:09:19.302960 [info ] [MainThread]: 
08:09:19.303368 [info ] [MainThread]: Finished running 1 seed in 27.86s.
08:09:19.303783 [debug] [MainThread]: Connection 'master' was properly closed.
08:09:19.304134 [debug] [MainThread]: Connection 'seed.postgres.my_seed_a' was properly closed.
08:09:19.309596 [info ] [MainThread]: 
08:09:19.309948 [info ] [MainThread]: Completed successfully
08:09:19.310287 [info ] [MainThread]: 
08:09:19.310588 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Notice we went from about 2 seconds to 20 seconds.

Relevant log output

See above.

Environment

- OS: macOS
- Python: 3.8.7
- dbt: 0.18.1, 1.0.7, 1.1.0

What database are you using dbt with?

postgres

Additional Context

Looks like this is due to the long wait time between the truncate and the insert:

# dbt 0.18.1
2022-05-31 05:57:43.062752 (Thread-1): SQL status: TRUNCATE TABLE in 0.01 seconds
2022-05-31 05:57:43.444330 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.444525 (Thread-1): On seed.postgres.my_seed_a: 
            insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values

# dbt 1.1
08:08:52.893625 [debug] [Thread-1  ]: SQL status: TRUNCATE TABLE in 0.01 seconds
08:09:18.845472 [debug] [Thread-1  ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:18.845913 [debug] [Thread-1  ]: On seed.postgres.my_seed_a: 
          insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values

For the record, this becomes really bad for a 100k line seed file (27 seconds > 270 seconds) because the inserts are 10k rows at a time and dbt >= 1.0 has a lot of seemingly idle time between each command (insert into ... ).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (12 by maintainers)

Most upvoted comments

60 seconds versus 7 seconds. The latter using the config specified above.

Versions:

Core:
  - installed: 1.4.5
  - latest:    1.4.5 - Up to date!

Plugins:
  - databricks: 1.4.2 - Up to date!
  - spark:      1.4.1 - Up to date!

properties.yml:

version: 2
seeds:
  - name: TargetsYearMonthWeek
    config:
      column_types:
        KPIName: string
        Year: int
        Month: int
        Week: int
        Field1: string
        Field2: string
        Field3: string
        Field4: string
        Field5: string
        TargetValue: int

The new table matches those columns, the old one is different in some ways: image