envoy: Hot restart failing often

Hi,

We’re seeing different types of failures during hot restarts of envoy. I’m listing them below -

“Previous envoy process is still initializing” error. There is a state file being shared between epochs and into which a previous epoch writes a flag - a new epoch that comes up as a result of a hot restart (basically it’s passed the --restart-epoch flag), during initialization, checks the flag in this file. If it finds it to be “INITIALIZING”, it quits and fails the hot restart. We’re seeing this almost always. There are two sub-scenarios during which this error is being observed -
```
         a. The previous epoch either crashes or is terminated with a SIGTERM and seems to fail to update the state file, causing the new epoch to quit. We don't see core dumps even if we use debug images and set ulimit -c unlimited, so the previous epoch is probably not crashing.
         b. The previous epoch remains alive and yet fails to update the state file, causing the new epoch to quit.
```
The hot restart fails with a “process failed to bind to port <x>” error. This probably happens because the previous epoch is still exiting and thus still is bound to the listener port (say 0.0.0.0:80), but the new epoch goes ahead and tries to bind to that port and fails. The previous epoch eventually exits but the new epoch doesn’t recover.
The hot restart fails with an assert raised on a failure by the new epoch to send an rpc message to the previous epoch on a unix domain socket. The conjecture is that the previous epoch quit too early (or crashed) and the new epoch didn’t know of it and tried to reach it on a socket. The stack trace in the envoy log shows as -

[2017-08-24 18:58:38.034][323][warning][main] source/server/server.cc:151] initializing epoch 5 (hot restart version=8.2490552)
[2017-08-24 18:58:38.034][323][info][main] source/server/server.cc:166] admin address: 0.0.0.0:8001
[2017-08-24 18:58:38.034][323][critical][assert] source/server/hot_restart_impl.cc:308] assert failure: rc != -1
[2017-08-24 18:58:38.034][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:101] Caught Aborted, suspect faulting address 0x143
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:85] Backtrace obj</usr/glibc-compat/lib/libc.so.6> thr<0> (use tools/stack_decode.py):
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #0 0x7fdbf1c164e6
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #1 0x7fdbf1c178d9
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:93] thr<0> obj</usr/local/bin/envoy>
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #2 0x54ca9d
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #3 0x54d70e
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #4 0x485cf0
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #5 0x48775a
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #6 0x42400b
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #7 0x412f65
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:93] thr<0> obj</usr/glibc-compat/lib/libc.so.6>
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #8 0x7fdbf1c03f34
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:93] thr<0> obj</usr/local/bin/envoy>
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:95] thr<0> #9 0x41aa98
[2017-08-24 18:58:38.035][323][critical][backtrace] bazel-out/local-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:97] end backtrace thread 0

Are these known issues with hot restarts? Having a reliable hot restart is critical to us to ensure that there is no downtime when we create new L7 rules.

Also, we run envoy in a k8s environment and we see more issues that seem to have to do with race/timing when we deploy this as a pod on k8s than when we run on a simple docker environment.

We have built an image with a rudimentary test script that simply starts new epochs that shift between two envoy-config.json files. You can follow these steps to reproduce the above issues -

docker run -it vijayendrabvs/envoy:testhotrestart sh
./envoy_debug_restart.sh

Let the above script run to completion -

/ # + /usr/local/bin/envoy -c /etc/envoy-default.json --restart-epoch 5 -l trace

Then, check the logs generated in /var/log/envoy*

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 19 (16 by maintainers)

Commits related to this issue

http: add wiring for explicit flow control along the response path (#1537) Description: Adds public interface support for explicit flow control mode on Android and iOS, including all wiring down to t... — committed to envoyproxy/envoy by goaway 3 years ago
http: add wiring for explicit flow control along the response path (#1537) Description: Adds public interface support for explicit flow control mode on Android and iOS, including all wiring down to t... — committed to envoyproxy/envoy by goaway 3 years ago

Most upvoted comments

@relistan we use the defaults at Lyft in production and they work fine. The problem is more nuanced. If you have a repro using current master I can take a look.

mattklein123 on Jan 25, 2018