egeria: [BUG] Egeria needs to detect Kafka current status and try to reestablish connection or report an error when wkc cluster is rebooted or services started in random/wrong order

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

  • After restarting Glossary pod before Kafka pod in the order, the OMRS sync is broken. However, Egeria logs that the OMRS startup is completed and it is running.
  • This issue is originated from customer issue and the same has been recreated in-house in wkc 4.0 as well as upcoming wkc 4.5 release QA build.

The problem scenario(s):

  1. Usually the customers reboot their Cloud Pak for Data wkc cluster(s) once in every two weeks. Every time they have observed that OMRS connection is broken after the rebooting their cluster. It is because of the wrong order of services started. The right order would be [[Kafka, OMAG, CAMS and BG].
  2. And we are able to recreate the same sync broken issue in-house by bringing down both Kafka and Glossary service pods and start Glossary service pod first and then Kafka pod.
  3. It may be possible that Kafka might go down and come up anytime after they are started in the right order also.

Expected Behavior

  • In such cases (if services are started in the wrong order glossary before kafka), I think the Egeria needs to detect the situation (current status of kafka) and be able to re-establish the connection between kafka and glossary and then send OMRS events to kafka.
  • In the worst case, Egeria should throw an error back to the caller reporting the problem if it is not possible to detect and re-establish the connection to kafka if it is broken for some reason (rebooting the cluster, starting the services in wrong order etc,.).

Steps To Reproduce

  • At the simplest, stop both kafka and glossary and start glossary service before kafka to reproduce this issue.

Environment

- Egeria: 3.5 (in wkc 4.5) AND 2.11 (in wkc 4.0)
- OS:
- Java:
- Browser (for UI issues):
- Additional connectors and integration:

WKC Assembly: 4.5.0+20220508.231221.684

Any Further Information?

Reference to our glossary internal defect https://github.ibm.com/wdp-gov/tracker/issues/44957

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I presume from the above snippet, you’re referring to the fact the kafka connector does get launched? The trace is from method calls.

Currently the setConfigurationStoreConnection does not return a failure (void), nor raise an exception, in the case where we hit the kafka error I posted above that no brokers are resolveable. This is a bug, as mentioned above. In fact we have a task to go through and ensure the error condition is checked - up through initializeCohortMember() and activateWithStoredConfig() to ensure that if any aspect doesn’t initialize properly, and doesn’t handle that asynchronously, we must fail the startup. That way the caller can observe the fact it’s failed - and other status query calls (also above) will clearly report the fact the server never became ready. I completely agree with this change and will be looking at it.

I’ll also check with debug how the KafkaAdminClient is managed to see if there’s any improvements that can be made there.

However it’s still necessary, or at least desirable, to ensure the service is available prior to starting the egeria server. The change would mean you’ll correctly see the failure and can respond, but you’ll need to act upon it.

Does that capture your point?