backstage: 🐛 Bug Report: Collating search indexing is slow and failed at inserting data from documents_to_insert table to documents table

📜 Description

Collating documents for techdocs and software-catalog keep failling with timeout error like .

{“level”:“error”,“message”:“Collating documents for techdocs failed: error: insert into documents (type, document, hash) select "type", "document", "hash" from "documents_to_insert" on conflict ("hash") do nothing - canceling statement due to statement timeout”,“plugin”:“search”,“service”:“backstage”,“type”:“plugin”} {“level”:“error”,“message”:“Collating documents for software-catalog failed: error: insert into documents (type, document, hash) select "type", "document", "hash" from "documents_to_insert" on conflict ("hash") do nothing - canceling statement due to statement timeout”,“plugin”:“search”,“service”:“backstage”,“type”:“plugin”}

Current statement_timeout is 5s, it works fine for staging with less data (documents table is 250 MB ), although production has more data, documents table is 360 MB, 5s should also be sufficient.

👍 Expected behavior

Collating documents for techdocs and software-catalog works fine.

👎 Actual Behavior with Screenshots

Collating documents for software-catalog and techdocs failed at insert into documents (type, document, hash) select \"type\", \"document\", \"hash\" from \"documents_to_insert\" on conflict (\"hash\") do nothing - canceling statement due to statement timeout when there is more data in production.

For staging, this query is successful after a long time. I have checked the pg_stat_statements in staging using select * from pg_stat_statements where query = 'insert into documents (type, document, hash) select "type", "document", "hash" from "documents_to_insert" on conflict ("hash") do nothing';

calls	total_time	min_time	max_time	mean_time	rows	shared_blks_hit	shared_blks_read	shared_blks_dirtied	local_blks_hit	local_blks_read	local_blks_written	blk_read_time
1	4679.825284	4679.825284	4679.825284	4679.825284	0	139071	7453	0	3	1896	1023	3286.667592
1	11371.131656	11371.131656	11371.131656	11371.131656	0	221909	8543	0	23695	7342	1021	423.288968
1	4197.742147	4197.742147	4197.742147	4197.742147	1	139107	7488	8	1	1899	1023	2875.815587
1	10825.297827	10825.297827	10825.297827	10825.297827	0	222233	8219	0	23695	7342	1021	78.340588
1	1286.8494780000000	1286.8494780000000	1286.8494780000000	1286.8494780000000	0	142457	4067	0	3	1896	1023	30.875488

👟 Reproduction steps

it happens when there are many documents like 360 MB.

📃 Provide the context for the Bug.

Now the searching indexing keeps failing.

I am wondering whether this issue is related to query insert into documents (type, document, hash) select \"type\", \"document\", \"hash\" from \"documents_to_insert\" on conflict (\"hash\") do nothing

Would it be caused by there is no index on documents_to_insert?

And since documents_to_insert is a temp table, i am not sure how to analyse this query’s performance.

🖥️ Your Environment

yarn backstage-cli info yarn run v1.22.18

OS: Darwin 21.6.0 - darwin/x64 node: v16.14.2 yarn: 1.22.18 cli: 0.18.1 (installed) backstage: 1.5.1

👀 Have you spent some time to check if this bug has been raised before?

I checked and didn’t find similar issue

🏢 Have you read the Code of Conduct?

I have read the Code of Conduct

Are you willing to submit PR?

Yes I am willing to submit a PR!

About this issue

Original URL
State: closed
Created a year ago
Comments: 19 (18 by maintainers)

Most upvoted comments

Looks like https://github.com/backstage/backstage/pull/20936 released with v1.20 fixed the immediate issue with performance, thanks @mariia-maksimova ⚡

nikolaik on Nov 20, 2023

+1 We’ve experienced a similar issue, only for us the collating was taking ±20 min every hour. After larger refreshes of various entities, it took up to 7h to finish (with subsequent runs piling up on each other 😦 ).

Maybe this will have no impact, but I wonder if the NOT IN subquery is problematic? Could be simplified to a LEFT JOIN with a NULL check on the joined table: https://github.com/backstage/backstage/blob/4b2a9b5/plugins/search-backend-module-pg/src/database/DatabaseDocumentStore.ts#L120-L126

This option did help a lot: bringing us from 20 min down to 3 min. We will probably look further into indexes as well, but replacing NOT IN definitely helped. It would be great to integrate it into the codebase 😃

mariia-maksimova on Nov 2, 2023