graphql-engine: Dead letter facility for event-trigger webhooks

We’d like a mechanism by which we can find out that webhooks (for event-triggers) are failing to run (after the max_retries and retry_interval based process is exhausted).

This “dead-letter queue” mechanism could perhaps be nominated on a global/installation basis.

Implemented how?

  • Brutalism: perhaps it is just a “well documented, semi offical SQL query” we can schedule to run regularly, which “selects” all failed event-triggers? This query would be against the extended Hasura metadata part of the database, I assume. Easy, but exposing implementation details is never a good idea.
  • Better: perhaps this dead-letter-queue should itself be a webhook to which failed events can be POSTed. Then it would be up to me, a developer, to ensure that this dead-letter webhook actually works via an alternative infrastructure to that which is failing elsewhere.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 30
  • Comments: 15 (8 by maintainers)

Commits related to this issue

Most upvoted comments

So, just to be clear: for webhooks to be robust, this feature is essential. event-triggers are an important part of our architecture, and it is “a bad thing” that they can potentially be failing silently (perhaps because we simply misconfigured the URL).

So, I wouldn’t be labelling this issue as “an idea”. I’d say it more strongly and claim that it is a bug that this mechanism isn’t there already. Please forgive my pushy-ness.

It is a well know pattern/need. For example, AWS’s SQS supports the idea of a dead-letter queue.

Any interest in doing this still? Would be a 🔥 feature

I agree with @mike-thompson-day8’s “Better” solution i.e to have a webhook that Hasura calls while marking an event as complete (failed or succeeded). This way users can capture the dead events, do their business logic and update their database if needed.

@dionjwa We are scoping this feature out this quarter. It will be great to get your input on this. If you are interested, we can setup a time through email. My id is tiru@hasura.io

@mike-thompson-day8 We are beginning to work on this issue. Since a “fire and forget” notification would not work for reliability, we are thinking of providing a “monitoring” webhook instead. The monitoring webhook will periodically receive stats about the event trigger including number of pending events/failed events, breakdown by the hour etc?

Since the monitoring webhook is running forever, you are effectively guaranteed to recv the notification sometime.