backstage: [RFC] do not allow rogue Locations

Status: Open for comments

Need

I’d like to introduce a new term, rogue Locations. A rogue Location is an optional Location which does not manage any entities and does not have any errors. In the current state of Backstage the discovery EntityProviders usually create these type of Locations and the old discovery processors worked a similar way.

As an example the GithubEntityProvider works this way:

it will fetch all of the repos of the organization.
It filters these repos based on the provided configuration
creates an optional Location entity for EVERY repo
Puts these locations into the catalog.
The catalog processing loop will take care of these and will fetch the content of the provided URL.

If your organization has 100 repositories and you configure your entity provider to discover a file called foo.yaml (which won’t exist in any of your repos) It will create 100 locations in your catalog. All of these locations are going to be without any entities that it spawned, and without any kind of errors, because the UrlReaderProcessor does not emit 404 errors.

The problem: These entities will be processed as any other entity and will be refreshed on the same cadence as others, but they are essentially useless because they did not spawn any other entities, nor they got an error attached to them and when the EntityProvider will be refreshed again, it will create these Locations again and they will be processed at that time again anyways, so I don’t think we should keep them in the catalog. They just exhaust the ratelimits and fill up the queue.

Proposal

I propose to introduce a mechanism to the catalog-backend which would be responsible not to have these entities in the catalog.

I think the DefaultCatalogProcessingEngine.ts could be updated to handle the rogue entities and basically just delete them from the refresh_state table when the correct condition applies.

Something like this:

          if (result.ok) {

            if (result.completedEntity.kind === 'Location') {
              if (result.completedEntity.spec?.presence === 'optional') {
                // check if the current processed entity is a rogue Location
                if (
                  result.deferredEntities.length === 0 &&
                  result.errors.length === 0
                ) {
                  await this.processingDatabase.options.database.transaction(
                    async tx => {
                      await tx('refresh_state')
                        .where({ entity_ref: entityRef })
                        .del();
                    },
                  );

                  track.markSuccessfulWithChanges(1);
                  return;
                }
              }
            }
          }

As an other implementation I can imagine something like extending the processingResult.ts with a type “deleted” which would be an array of entityRefs and the processing engine would take of deleting these entities at the end. In this case, we could add a custom processor which would handle the logic to determine if there are some entities that should be deleted at the end of the processing loop.

Alternatives

As an alternative we could mark somehow these entities for example with an annotation and make the UrlReaderProcessor aware of it to basically skip these entities. It would result in less API calls, but they would still fill up your catalog.

Risks

The catalog DefaultProcessingEngine would be aware of the kind of entity that it processes. It might be a knowledge that it shouldn’t have ? The catalog would delete from its own database at the end of the processing loop. I can see that there might be some confusion for users who are already familiar with Backstage, however I do think it is a cleaner/ better way overall, that in the catalog you only have entries that actually do something or have meaning for having them in.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 35 (30 by maintainers)

Most upvoted comments

FYI: tomorrow, October 12th, at 10 AM EST, we’re hosting a discussion on this RFC. Feel free to pop into this thread for details https://discord.com/channels/687207715902193673/1029071173893492776

taras on Oct 11, 2022

I haven’t dived into how EntityProviders work under the covers, but since this doesn’t use the regular processing loop, would you not bypass any additional catalog processors?

@andrewthauer, your processors will run exactly the same way because all processors run on all entities committed by all entity providers. It just happens faster because the processing loop is not clogged with location reading operations.

taras on Oct 10, 2022

@andrewthauer I used the GraphQL query you suggested above to create a POC of an incremental GitHub Discovery Entity Provider . The meat of the discovery mechanism is between lines 103-176

A few interesting takeaways from this experiment,

It’s not using locations, so there is no way to manually trigger a refresh request. In the closed-source version, we have a REST API that can be used to manipulate the entity provider. We could put the URL of each entity provider into ANNOTATION_ORIGIN_LOCATION and use that to trigger the entity provider to refresh the entity.
It doesn’t rely on the processing loop, so the default refresh frequency is based on the configuration of the entity provider. This makes it easy to configure update frequency for each entity provider.
This should work well with webhooks as well
Since processing loop is not used for refreshing entities, it remains free to process incoming entities from any entity provider in real time.

taras on Oct 9, 2022

Why does it different in complexity to introduce a totally new kind that gets deleted anyways( and readded I believe on the next run) than just not allowing to have optional Locations which do not contribute anything to the catalog?

The location mechanism and its processing is already quite opaque. I’ve used it for a few years and designed a bunch of processors, and I still don’t feel confident with how it works. Adding more rules that’ll change the behavior of an already opaque system seems undesirable to me.

We could eliminate many of the problems caused by Locations by simply not using Locations. One argument for Locations is that they provide a place to store errors. I believe we can achieve the same result by capturing the ingestion error and committing that error as an IngestionError kind.

Niskso suggested in discord that we could use URLReader to read locations in Entity Providers. This idea could work very well with the idea of returning an IngestionError kind. The URLReader could read the URL and either return parsed entity or IngestionError kind. The IngestingError kind entities would end up in the database where they can inspected.

The result of this approach is that the processing loop would have no additional work to do. The error would automatically get removed/re-created on the next entity provide execution. We could even filter which errors we want to keep or throw away.

taras on Oct 4, 2022