dbt-core: [CT-697] [Bug] Seeds are considerably slower with dbt >= 1.0
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
There appear to be some degraded performance when using dbt seed with dbt >= 1.0 (like 10x slower).
-! CAVEAT FOR ISSUE READERS !-
dbt doesn’t recommend the seed functionality for large seed files in the first place.
Expected Behavior
dbt seed should perform as it did prior to 1.0.
Steps To Reproduce
Setup
- Install docker + clone the dbt-core repo + follow the initial setup in the contributing docs - this is to setup the postgres db locally (skip if you already have a local postgres db setup).
- Generate a
csvfile with10_000rows (I used this gist) and copy it to your seeds/data path in your dbt project. - Add the seeds to your
dbt_project.ymlfile, mine was something like this:
name: "postgres"
version: "1.0.0"
config-version: 2
profile: "postgres"
data-paths: ["seeds"]
target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "dbt_packages"
models:
postgres:
+materialized: table
seeds:
postgres:
+enabled: true
+quote_columns: false
Install dbt + seed your csv
All the following were installed into separate virtual envs.
Note that the seed
my_seed_aalready exists in the db so dbt always did atruncatein the steps below. The same degraded performance was experienced if the seed hadn’t already exist in the db, which means dbt would docreateinstead oftruncate.
dbt 0.18.1
pip install dbt==0.18.1
pip install markupsafe==2.0.1
dbt seed
2022-05-31 05:57:41.850908 (MainThread): 17:57:41 | Concurrency: 8 threads (target='dev')
2022-05-31 05:57:41.851093 (MainThread): 17:57:41 |
2022-05-31 05:57:41.857162 (Thread-1): Began running node seed.postgres.my_seed_a
2022-05-31 05:57:41.858582 (Thread-1): 17:57:41 | 1 of 1 START seed file public.my_seed_a.............................. [RUN]
2022-05-31 05:57:41.859322 (Thread-1): Acquiring new postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:41.859650 (Thread-1): finished collecting timing info
2022-05-31 05:57:43.044005 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.044181 (Thread-1): On seed.postgres.my_seed_a: BEGIN
2022-05-31 05:57:43.044270 (Thread-1): Opening a new connection, currently in state closed
2022-05-31 05:57:43.056447 (Thread-1): SQL status: BEGIN in 0.01 seconds
2022-05-31 05:57:43.056643 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.056732 (Thread-1): On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "0.18.1", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
2022-05-31 05:57:43.062752 (Thread-1): SQL status: TRUNCATE TABLE in 0.01 seconds
2022-05-31 05:57:43.444330 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.444525 (Thread-1): On seed.postgres.my_seed_a:
insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%...
2022-05-31 05:57:43.787235 (Thread-1): SQL status: INSERT 0 10000 in 0.34 seconds
2022-05-31 05:57:43.792411 (Thread-1): Writing runtime SQL for node "seed.postgres.my_seed_a"
2022-05-31 05:57:43.800577 (Thread-1): On seed.postgres.my_seed_a: COMMIT
2022-05-31 05:57:43.800747 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.800831 (Thread-1): On seed.postgres.my_seed_a: COMMIT
2022-05-31 05:57:43.810414 (Thread-1): SQL status: COMMIT in 0.01 seconds
2022-05-31 05:57:43.812294 (Thread-1): finished collecting timing info
2022-05-31 05:57:43.812579 (Thread-1): On seed.postgres.my_seed_a: Close
2022-05-31 05:57:43.814104 (Thread-1): Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '56fff452-6d07-4d50-afd8-95b873e36e9d', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10a89ee20>]}
2022-05-31 05:57:43.814415 (Thread-1): 17:57:43 | 1 of 1 OK loaded seed file public.my_seed_a.......................... [INSERT 10000 in 1.95s]
2022-05-31 05:57:43.814566 (Thread-1): Finished running node seed.postgres.my_seed_a
2022-05-31 05:57:43.816776 (MainThread): Acquiring new postgres connection "master".
2022-05-31 05:57:43.817080 (MainThread): Using postgres connection "master".
2022-05-31 05:57:43.817243 (MainThread): On master: BEGIN
2022-05-31 05:57:43.817379 (MainThread): Opening a new connection, currently in state closed
2022-05-31 05:57:43.835475 (MainThread): SQL status: BEGIN in 0.02 seconds
2022-05-31 05:57:43.835730 (MainThread): On master: COMMIT
2022-05-31 05:57:43.835850 (MainThread): Using postgres connection "master".
2022-05-31 05:57:43.835953 (MainThread): On master: COMMIT
2022-05-31 05:57:43.837042 (MainThread): SQL status: COMMIT in 0.00 seconds
2022-05-31 05:57:43.837297 (MainThread): On master: Close
2022-05-31 05:57:43.837763 (MainThread): 17:57:43 |
2022-05-31 05:57:43.837974 (MainThread): 17:57:43 | Finished running 1 seed in 2.14s.
2022-05-31 05:57:43.838150 (MainThread): Connection 'master' was properly closed.
2022-05-31 05:57:43.838276 (MainThread): Connection 'seed.postgres.my_seed_a' was properly closed.
2022-05-31 05:57:43.847005 (MainThread):
2022-05-31 05:57:43.847287 (MainThread): Completed successfully
2022-05-31 05:57:43.847494 (MainThread):
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
dbt 1.0.7
pip install dbt-core==1.0.7 dbt-postgres==1.0.7
dbt seed
05:58:27.677062 [info ] [MainThread]: Concurrency: 8 threads (target='dev')
05:58:27.677869 [info ] [MainThread]:
05:58:27.684727 [debug] [Thread-1 ]: Began running node seed.postgres.my_seed_a
05:58:27.685266 [info ] [Thread-1 ]: 1 of 1 START seed file public.my_seed_a......................................... [RUN]
05:58:27.686616 [debug] [Thread-1 ]: Acquiring new postgres connection "seed.postgres.my_seed_a"
05:58:27.687048 [debug] [Thread-1 ]: Began compiling node seed.postgres.my_seed_a
05:58:27.687441 [debug] [Thread-1 ]: finished collecting timing info
05:58:27.687691 [debug] [Thread-1 ]: Began executing node seed.postgres.my_seed_a
05:58:28.955714 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:28.955965 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: BEGIN
05:58:28.956123 [debug] [Thread-1 ]: Opening a new connection, currently in state closed
05:58:28.967233 [debug] [Thread-1 ]: SQL status: BEGIN in 0.01 seconds
05:58:28.967527 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:28.967703 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "1.0.7", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
05:58:28.971334 [debug] [Thread-1 ]: SQL status: TRUNCATE TABLE in 0.0 seconds
05:58:54.523752 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:54.524875 [debug] [Thread-1 ]: On seed.postgres.my_seed_a:
insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s...
05:58:54.874008 [debug] [Thread-1 ]: SQL status: INSERT 0 10000 in 0.35 seconds
05:58:54.879118 [debug] [Thread-1 ]: Writing runtime SQL for node "seed.postgres.my_seed_a"
05:58:54.886762 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: COMMIT
05:58:54.887028 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
05:58:54.887209 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: COMMIT
05:58:54.891714 [debug] [Thread-1 ]: SQL status: COMMIT in 0.0 seconds
05:58:54.892380 [debug] [Thread-1 ]: finished collecting timing info
05:58:54.892604 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: Close
05:58:54.894010 [debug] [Thread-1 ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'd718dd74-94d7-4dd4-b596-8682f54f3943', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x109e07ee0>]}
05:58:54.894367 [info ] [Thread-1 ]: 1 of 1 OK loaded seed file public.my_seed_a..................................... [[32mINSERT 10000[0m in 27.21s]
05:58:54.894741 [debug] [Thread-1 ]: Finished running node seed.postgres.my_seed_a
05:58:54.896836 [debug] [MainThread]: Acquiring new postgres connection "master"
05:58:54.897364 [debug] [MainThread]: Using postgres connection "master"
05:58:54.897733 [debug] [MainThread]: On master: BEGIN
05:58:54.898062 [debug] [MainThread]: Opening a new connection, currently in state closed
05:58:54.931927 [debug] [MainThread]: SQL status: BEGIN in 0.03 seconds
05:58:54.932395 [debug] [MainThread]: On master: COMMIT
05:58:54.932709 [debug] [MainThread]: Using postgres connection "master"
05:58:54.933003 [debug] [MainThread]: On master: COMMIT
05:58:54.934533 [debug] [MainThread]: SQL status: COMMIT in 0.0 seconds
05:58:54.934977 [debug] [MainThread]: On master: Close
05:58:54.935634 [info ] [MainThread]:
05:58:54.936256 [info ] [MainThread]: Finished running 1 seed in 27.42s.
05:58:54.936907 [debug] [MainThread]: Connection 'master' was properly closed.
05:58:54.937327 [debug] [MainThread]: Connection 'seed.postgres.my_seed_a' was properly closed.
05:58:54.943688 [info ] [MainThread]:
05:58:54.944539 [info ] [MainThread]: [32mCompleted successfully[0m
dbt 1.1.0
pip install dbt-core==1.1.0 dbt-postgres==1.1.0
dbt seed
08:08:51.605497 [info ] [MainThread]: Concurrency: 8 threads (target='dev')
08:08:51.605966 [info ] [MainThread]:
08:08:51.611963 [debug] [Thread-1 ]: Began running node seed.postgres.my_seed_a
08:08:51.612711 [info ] [Thread-1 ]: 1 of 1 START seed file public.my_seed_a ........................................ [RUN]
08:08:51.614090 [debug] [Thread-1 ]: Acquiring new postgres connection "seed.postgres.my_seed_a"
08:08:51.615405 [debug] [Thread-1 ]: Began compiling node seed.postgres.my_seed_a
08:08:51.616236 [debug] [Thread-1 ]: finished collecting timing info
08:08:51.616654 [debug] [Thread-1 ]: Began executing node seed.postgres.my_seed_a
08:08:52.871344 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
08:08:52.871767 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: BEGIN
08:08:52.872066 [debug] [Thread-1 ]: Opening a new connection, currently in state closed
08:08:52.887434 [debug] [Thread-1 ]: SQL status: BEGIN in 0.02 seconds
08:08:52.887853 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
08:08:52.888124 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: /* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "postgres", "target_name": "dev", "node_id": "seed.postgres.my_seed_a"} */
truncate table "postgres"."public"."my_seed_a"
08:08:52.893625 [debug] [Thread-1 ]: SQL status: TRUNCATE TABLE in 0.01 seconds
08:09:18.845472 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:18.845913 [debug] [Thread-1 ]: On seed.postgres.my_seed_a:
insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s),(%s,%s,%s,%s,%s...
08:09:19.226980 [debug] [Thread-1 ]: SQL status: INSERT 0 10000 in 0.38 seconds
08:09:19.232895 [debug] [Thread-1 ]: Writing runtime SQL for node "seed.postgres.my_seed_a"
08:09:19.242962 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: COMMIT
08:09:19.243377 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:19.243676 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: COMMIT
08:09:19.256668 [debug] [Thread-1 ]: SQL status: COMMIT in 0.01 seconds
08:09:19.258547 [debug] [Thread-1 ]: finished collecting timing info
08:09:19.259305 [debug] [Thread-1 ]: On seed.postgres.my_seed_a: Close
08:09:19.260948 [debug] [Thread-1 ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'efdc63ee-7ed7-4aef-899c-b7004b5a3917', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10705b1c0>]}
08:09:19.261629 [info ] [Thread-1 ]: 1 of 1 OK loaded seed file public.my_seed_a .................................... [[32mINSERT 10000[0m in 27.65s]
08:09:19.262245 [debug] [Thread-1 ]: Finished running node seed.postgres.my_seed_a
08:09:19.265492 [debug] [MainThread]: Acquiring new postgres connection "master"
08:09:19.266367 [debug] [MainThread]: Using postgres connection "master"
08:09:19.266922 [debug] [MainThread]: On master: BEGIN
08:09:19.267332 [debug] [MainThread]: Opening a new connection, currently in state closed
08:09:19.298020 [debug] [MainThread]: SQL status: BEGIN in 0.03 seconds
08:09:19.298505 [debug] [MainThread]: On master: COMMIT
08:09:19.298843 [debug] [MainThread]: Using postgres connection "master"
08:09:19.299175 [debug] [MainThread]: On master: COMMIT
08:09:19.301994 [debug] [MainThread]: SQL status: COMMIT in 0.0 seconds
08:09:19.302421 [debug] [MainThread]: On master: Close
08:09:19.302960 [info ] [MainThread]:
08:09:19.303368 [info ] [MainThread]: Finished running 1 seed in 27.86s.
08:09:19.303783 [debug] [MainThread]: Connection 'master' was properly closed.
08:09:19.304134 [debug] [MainThread]: Connection 'seed.postgres.my_seed_a' was properly closed.
08:09:19.309596 [info ] [MainThread]:
08:09:19.309948 [info ] [MainThread]: [32mCompleted successfully[0m
08:09:19.310287 [info ] [MainThread]:
08:09:19.310588 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
Notice we went from about 2 seconds to 20 seconds.
Relevant log output
See above.
Environment
- OS: macOS
- Python: 3.8.7
- dbt: 0.18.1, 1.0.7, 1.1.0
What database are you using dbt with?
postgres
Additional Context
Looks like this is due to the long wait time between the truncate and the insert:
# dbt 0.18.1
2022-05-31 05:57:43.062752 (Thread-1): SQL status: TRUNCATE TABLE in 0.01 seconds
2022-05-31 05:57:43.444330 (Thread-1): Using postgres connection "seed.postgres.my_seed_a".
2022-05-31 05:57:43.444525 (Thread-1): On seed.postgres.my_seed_a:
insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
# dbt 1.1
08:08:52.893625 [debug] [Thread-1 ]: SQL status: TRUNCATE TABLE in 0.01 seconds
08:09:18.845472 [debug] [Thread-1 ]: Using postgres connection "seed.postgres.my_seed_a"
08:09:18.845913 [debug] [Thread-1 ]: On seed.postgres.my_seed_a:
insert into "postgres"."public"."my_seed_a" (first_name, last_name, job, company, amount, balance, timestamp_at, updated_at, deleted_at, created_at) values
For the record, this becomes really bad for a 100k line seed file (27 seconds > 270 seconds) because the inserts are 10k rows at a time and dbt >= 1.0 has a lot of seemingly idle time between each command (insert into ... ).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (12 by maintainers)
60 seconds versus 7 seconds. The latter using the config specified above.
Versions:
properties.yml:
The new table matches those columns, the old one is different in some ways: