backstage: [RFC] do not allow rogue Locations
Status: Open for comments
Need
I’d like to introduce a new term, rogue Locations. A rogue Location is an optional Location which does not manage any entities and does not have any errors. In the current state of Backstage the discovery EntityProviders usually create these type of Locations and the old discovery processors worked a similar way.
As an example the GithubEntityProvider works this way:
- it will fetch all of the repos of the organization.
- It filters these repos based on the provided configuration
- creates an optional Location entity for EVERY repo
- Puts these locations into the catalog.
- The catalog processing loop will take care of these and will fetch the content of the provided URL.
If your organization has 100 repositories and you configure your entity provider to discover a file called foo.yaml (which won’t exist in any of your repos) It will create 100 locations in your catalog. All of these locations are going to be without any entities that it spawned, and without any kind of errors, because the UrlReaderProcessor does not emit 404 errors.
The problem: These entities will be processed as any other entity and will be refreshed on the same cadence as others, but they are essentially useless because they did not spawn any other entities, nor they got an error attached to them and when the EntityProvider will be refreshed again, it will create these Locations again and they will be processed at that time again anyways, so I don’t think we should keep them in the catalog. They just exhaust the ratelimits and fill up the queue.
Proposal
I propose to introduce a mechanism to the catalog-backend which would be responsible not to have these entities in the catalog.
I think the DefaultCatalogProcessingEngine.ts could be updated to handle the rogue entities and basically just delete them from the refresh_state table when the correct condition applies.
Something like this:
if (result.ok) {
if (result.completedEntity.kind === 'Location') {
if (result.completedEntity.spec?.presence === 'optional') {
// check if the current processed entity is a rogue Location
if (
result.deferredEntities.length === 0 &&
result.errors.length === 0
) {
await this.processingDatabase.options.database.transaction(
async tx => {
await tx('refresh_state')
.where({ entity_ref: entityRef })
.del();
},
);
track.markSuccessfulWithChanges(1);
return;
}
}
}
}
As an other implementation I can imagine something like extending the processingResult.ts with a type “deleted” which would be an array of entityRefs and the processing engine would take of deleting these entities at the end. In this case, we could add a custom processor which would handle the logic to determine if there are some entities that should be deleted at the end of the processing loop.
Alternatives
As an alternative we could mark somehow these entities for example with an annotation and make the UrlReaderProcessor aware of it to basically skip these entities. It would result in less API calls, but they would still fill up your catalog.
Risks
The catalog DefaultProcessingEngine would be aware of the kind of entity that it processes. It might be a knowledge that it shouldn’t have ? The catalog would delete from its own database at the end of the processing loop. I can see that there might be some confusion for users who are already familiar with Backstage, however I do think it is a cleaner/ better way overall, that in the catalog you only have entries that actually do something or have meaning for having them in.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 35 (30 by maintainers)
FYI: tomorrow, October 12th, at 10 AM EST, we’re hosting a discussion on this RFC. Feel free to pop into this thread for details https://discord.com/channels/687207715902193673/1029071173893492776
@andrewthauer, your processors will run exactly the same way because all processors run on all entities committed by all entity providers. It just happens faster because the processing loop is not clogged with location reading operations.
@andrewthauer I used the GraphQL query you suggested above to create a POC of an incremental GitHub Discovery Entity Provider . The meat of the discovery mechanism is between lines 103-176
A few interesting takeaways from this experiment,
ANNOTATION_ORIGIN_LOCATIONand use that to trigger the entity provider to refresh the entity.The location mechanism and its processing is already quite opaque. I’ve used it for a few years and designed a bunch of processors, and I still don’t feel confident with how it works. Adding more rules that’ll change the behavior of an already opaque system seems undesirable to me.
We could eliminate many of the problems caused by Locations by simply not using Locations. One argument for Locations is that they provide a place to store errors. I believe we can achieve the same result by capturing the ingestion error and committing that error as an IngestionError kind.
Niskso suggested in discord that we could use URLReader to read locations in Entity Providers. This idea could work very well with the idea of returning an IngestionError kind. The URLReader could read the URL and either return parsed entity or
IngestionErrorkind. The IngestingError kind entities would end up in the database where they can inspected.The result of this approach is that the processing loop would have no additional work to do. The error would automatically get removed/re-created on the next entity provide execution. We could even filter which errors we want to keep or throw away.