sensu-go: Sensu Go WebUI Randomly Crashing

After some (random) time, the Sensu Go WebUI crashes until the entire sensu-backend service gets restarted.

(As follow-up for https://github.com/sensu/sensu-go/issues/4070)

Expected Behavior

Sensu Go WebUI should be reliable and not randomly crash.

Current Behavior

Sensu Go WebUI crashes; no process listening anymore to port TCP/3000.

Steps to Reproduce (for bugs)

  1. Start sensu-backend: systemctl start sensu-backend
  2. Try to access the WebUI: success 😃
  3. Wait a bit (typically 1-2 hours)
  4. Try to access the WebUI again: failure 😦

Context

Please find attached to this issue a log file containing entries corresponding to the time the WebUI crashed on my system.

Your Environment

  • Sensu version used (sensuctl, sensu-backend, and/or sensu-agent):
ii  sensu-go-agent                6.2.0-3888                        amd64        Sensu Go Agent
ii  sensu-go-backend              6.2.0-3888                        amd64        Sensu Go Backend
  • Installation method (packages, binaries, docker etc.): Packages (deb)
  • Operating System and version (e.g. Ubuntu 14.04): Ubuntu 20.04
  • Virtualization type: Linux Container running on Proxmox Linux sensu-test 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100) x86_64 x86_64 x86_64 GNU/Linux

If you need additional information, feel free to ask 😃

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 28 (6 by maintainers)

Most upvoted comments

@echlebek So far so good

root@sensu-go:~# date
Wed Mar 17 08:54:43 CET 2021
root@sensu-go:~# ps -ef|grep sensu
sensu    20304     1 22 Mar11 ?        1-07:22:50 /usr/sbin/sensu-backend start -c /etc/sensu/backend.yml

Don’t want to celebrate too soon, but looks like your suggestion helped. Thanks!

Hi folks, I wanted to let you all know that we’ve found an issue that is likely linked to what you’ve been experiencing.

The issue is in the commercial distribution of the software and doesn’t impact the OSS project, so unfortunately I don’t have an issue to reference here. I’ll try to explain what can sometimes happen.

When Sensu’s etcd client times out after several retries, it propagates an error to the various services that sensu-backend provides. Since the error is unrecoverable, the backend tells all of the services to stop and start again, with a new etcd client. This happens whether or not etcd is running embedded within sensu-backend, or if it’s being run externally, with sensu-backend as a client only (--no-embed-etcd).

Unfortunately, we found a bug in one of the services that causes it to hang on shutdown. This leads to sensu-backend running only partially, and being mostly broken. The most obvious impact is the web UI no longer running, but other services will have failed to start up as well.

We’ve fixed this issue and it will be resolved in 6.3.0, but those of you encountering frequent crashes will still need to do some things to avoid having problems; a non-functioning etcd is going to be an issue even if this bug is resolved.

If you are running etcd on network attached storage, then you need to increase the heartbeat timeout.

Due to its design, etcd is extremely sensitive to I/O latency, and its defaults are written assuming that it will be deployed to locally attached low-latency storage. By default, etcd’s heartbeat timeout is only 1 second. This is why etcd issues warnings for reads that take longer than 100 milliseconds.

Even small installations should deploy to locally attached SSDs, but if you really want to deploy to network attached storage, you can improve stability by increasing the hearbeat interval and heartbeat timeout. I would suggest increasing the heartbeat interval to 1 second, and the timeout to 10 seconds. Note that even when you increase heartbeat interval and heartbeat timeout, etcd will erroneously warn about reads that are longer than 100ms; this can be considered normal.

Increasing the heartbeat timeout does come at a cost. When your etcd cluster experiences an election, it will take longer for service to be restored than with tighter timeouts. So be mindful when adjusting the timeout, that this number represents the minimum amount of time it will take the cluster to restore service after an election.

These issues can still occur even if you have a single node etcd cluster. A single node etcd cluster can still experience elections.

You can read more about tuning etcd here, including guidelines for setting the heartbeat timeout: https://etcd.io/docs/v3.4.0/tuning/

I genuinely hope this helps those of you that have been struggling with stability. We do hope to reduce our reliance on etcd going forward, as it can be quite finicky about its deployment environment. When deployed and tuned well, it can be quite reliable and resilient, but it can be somewhat challenging!