istio: Envoy crashes when upgrading WASM modules under load

Bug description

I recently developed a c++ WASM module which manipulates response headers based on request metadata. The module seems to work fine under normal operation. However, I found that when upgrading it to a new version, every istio-proxy running it would crash pretty much in unison. The upgrade was performed by changing the uri pointing to the wasm module in the wasm EnvoyFilter. Note that I am using the wasm module distribution documented here: https://istio.io/latest/docs/ops/configuration/extensibility/wasm-module-distribution.

The issue did not reproduce if the cluster was not receiving a large amount of traffic. It appears as though something is being deleted (perhaps the wasm plugin context?) when the configuration changes, but the request instance still holds a handle to it. I have two different backtraces. I’ll paste the shorter one here, and attach the longer one.

2021-05-21T19:58:43.602418Z	critical	envoy backtrace	Caught Segmentation fault, suspect faulting address 0x5645000bcc60
2021-05-21T19:58:43.602462Z	critical	envoy backtrace	Backtrace (use tools/stack_decode.py to get line numbers):
2021-05-21T19:58:43.602469Z	critical	envoy backtrace	Envoy version: 66da2bf864bde982351ee0ca2cae0a4e931f923c/1.17.1/Clean/RELEASE/BoringSSL
2021-05-21T19:58:43.604067Z	critical	envoy backtrace	#0: __restore_rt [0x7fcef9aef980]
2021-05-21T19:58:43.638650Z	critical	envoy backtrace	#1: Envoy::Extensions::Common::Wasm::Context::decodeData() [0x5644fb9e074b]
2021-05-21T19:58:43.673538Z	critical	envoy backtrace	#2: Envoy::Http::FilterManager::decodeData() [0x5644fcbcdcf6]
2021-05-21T19:58:43.703375Z	critical	envoy backtrace	#3: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeData() [0x5644fcbc0da6]
2021-05-21T19:58:43.730489Z	critical	envoy backtrace	#4: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0x5644fcbee96e]
2021-05-21T19:58:43.757338Z	critical	envoy backtrace	#5: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::__invoke() [0x5644fcbf5888]
2021-05-21T19:58:43.784394Z	critical	envoy backtrace	#6: nghttp2_session_on_data_received [0x5644fcd4a5ac]
2021-05-21T19:58:44.198038Z	error	Epoch 0 exited with error: signal: segmentation fault (core dumped)

envoy-crash.txt

[ ] Docs [ ] Installation [ X ] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [ ] Upgrade

Expected behavior

New WASM plugin loaded without any request loss. It is acceptable for existing requests to continue to run against the old plugin.

Steps to reproduce the bug

  1. Configure istio to use the wasm module distribution (https://istio.io/latest/docs/ops/configuration/extensibility/wasm-module-distribution/) to load a wasm module.
  2. Send requests at a high rate (e.g. while [ 1 ]; do curl https://my-site.example.com/; done).
  3. Change the wasm module to point to a different version.
  4. Watch proxy logs. See it crash and restart. See all in-flight requests drop.

Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm):

In my GKE env:

istioctl version --remote
client version: 1.9.5
control plane version: 1.9.5
data plane version: 1.9.5 (50 proxies)

In my microk8s env:

istioctl version
client version: 1.9.5
control plane version: 1.9.5
data plane version: 1.9.5 (21 proxies)

How was Istio installed? Used istioctl install -f <config-file>

Environment where the bug was observed (cloud vendor, OS, etc)

  • GKE
  • independently, microk8s on Ubuntu 20.04

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Update: succeeded to reproduce locally without Istio and for both Envoy’s HEAD and istio/proxy, but couldn’t identify the root cause… stay tuned.

@klarose I’m pretty confident that this is an Envoy specific issue and not of Wasm/VM/Compiler’s.