backstage: ๐Ÿ› Bug Report: [Tech insights plugin] the expired facts won't be cleared if the source entity is deleted

๐Ÿ“œ Description

The tech insights plugin fact retriever supports setting TTLs for the facts, and the expired records should be cleaned up automatically.

All of the fact retrievers we enabled have lifecycle set to { timeToLive: { weeks: 2 } }. However, when I check the facts table, we see many records are clearly out of the 2 weeks range and not being cleared.

๐Ÿ‘ Expected behavior

All the expired records in the facts table should be cleared.

๐Ÿ‘Ž Actual Behavior with Screenshots

Many expired records are still in the database, and all of them are from entities that are removed from the catalog.

๐Ÿ‘Ÿ Reproduction steps

=> select count(*) from tech_insights.facts where timestamp< now() - INTERVAL '14 DAY';
 count
--------
 237381
(1 row)

๐Ÿ“ƒ Provide the context for the Bug.

The expired items cleanup happens in the insertFacts: https://github.com/backstage/backstage/blob/7b47138d9a655625f17dbb372796a07a18bf1c23/plugins/tech-insights-backend/src/service/persistence/TechInsightsDatabase.ts#L120-L126

deleteExpiredFactsByDate and deleteExpiredFactsByNumber are the two methods that handle the actual deletion of the records, and both of them accept factRows, which are the latest facts from fact retrievers. The factRows restrain the scope of the deletion to only the currently present entities. In other words, if an entity was present and then deleted, all the fact items related to that entity will be left in the database forever.

For deleteExpiredFactsByDate, we can tweak the SQL query to make it delete all the expired records that match the fact retriever id. It would roughly be equivalent to the following statement:

delete from facts where id="entityMetadataFactRetriever" and timestamp < now() - INTERVAL '14 DAY'

We can delete in batch (say 5000 records at a time) so it wonโ€™t choke the database.

However, itโ€™s a bit more complicated for deleteExpiredFactsByNumber. Since it means โ€œdelete all but N latest facts for each id/entity pairโ€, it has to be aware whatโ€™s the entity itโ€™s acting on. We could:

  • Join the table in catalog. But it breaks the assumption that all the plugins should be isolated.
  • Query the catalog backend endpoint to check if the entity is still available, but Iโ€™m not sure if it will incur too much load on the backend.

๐Ÿ–ฅ๏ธ Your Environment

No response

๐Ÿ‘€ Have you spent some time to check if this bug has been raised before?

  • I checked and didnโ€™t find similar issue

๐Ÿข Have you read the Code of Conduct?

Are you willing to submit PR?

Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (10 by maintainers)

Most upvoted comments

Thanks @namco1992. Thatโ€™s correct, for the timed queries we donโ€™t care about the entity so the whereIn is not needed. Iโ€™ll create a subsequent PR to take that away.