istio: Envoy crashes when upgrading WASM modules under load
Bug description
I recently developed a c++ WASM module which manipulates response headers based on request metadata. The module seems to work fine under normal operation. However, I found that when upgrading it to a new version, every istio-proxy running it would crash pretty much in unison. The upgrade was performed by changing the uri pointing to the wasm module in the wasm EnvoyFilter. Note that I am using the wasm module distribution documented here: https://istio.io/latest/docs/ops/configuration/extensibility/wasm-module-distribution.
The issue did not reproduce if the cluster was not receiving a large amount of traffic. It appears as though something is being deleted (perhaps the wasm plugin context?) when the configuration changes, but the request instance still holds a handle to it. I have two different backtraces. I’ll paste the shorter one here, and attach the longer one.
2021-05-21T19:58:43.602418Z critical envoy backtrace Caught Segmentation fault, suspect faulting address 0x5645000bcc60
2021-05-21T19:58:43.602462Z critical envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2021-05-21T19:58:43.602469Z critical envoy backtrace Envoy version: 66da2bf864bde982351ee0ca2cae0a4e931f923c/1.17.1/Clean/RELEASE/BoringSSL
2021-05-21T19:58:43.604067Z critical envoy backtrace #0: __restore_rt [0x7fcef9aef980]
2021-05-21T19:58:43.638650Z critical envoy backtrace #1: Envoy::Extensions::Common::Wasm::Context::decodeData() [0x5644fb9e074b]
2021-05-21T19:58:43.673538Z critical envoy backtrace #2: Envoy::Http::FilterManager::decodeData() [0x5644fcbcdcf6]
2021-05-21T19:58:43.703375Z critical envoy backtrace #3: Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeData() [0x5644fcbc0da6]
2021-05-21T19:58:43.730489Z critical envoy backtrace #4: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0x5644fcbee96e]
2021-05-21T19:58:43.757338Z critical envoy backtrace #5: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::__invoke() [0x5644fcbf5888]
2021-05-21T19:58:43.784394Z critical envoy backtrace #6: nghttp2_session_on_data_received [0x5644fcd4a5ac]
2021-05-21T19:58:44.198038Z error Epoch 0 exited with error: signal: segmentation fault (core dumped)
[ ] Docs [ ] Installation [ X ] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [ ] Upgrade
Expected behavior
New WASM plugin loaded without any request loss. It is acceptable for existing requests to continue to run against the old plugin.
Steps to reproduce the bug
- Configure istio to use the wasm module distribution (https://istio.io/latest/docs/ops/configuration/extensibility/wasm-module-distribution/) to load a wasm module.
- Send requests at a high rate (e.g.
while [ 1 ]; do curl https://my-site.example.com/; done
). - Change the wasm module to point to a different version.
- Watch proxy logs. See it crash and restart. See all in-flight requests drop.
Version (include the output of istioctl version --remote
and kubectl version --short
and helm version --short
if you used Helm):
In my GKE env:
istioctl version --remote
client version: 1.9.5
control plane version: 1.9.5
data plane version: 1.9.5 (50 proxies)
In my microk8s env:
istioctl version
client version: 1.9.5
control plane version: 1.9.5
data plane version: 1.9.5 (21 proxies)
How was Istio installed? Used istioctl install -f <config-file>
Environment where the bug was observed (cloud vendor, OS, etc)
- GKE
- independently, microk8s on Ubuntu 20.04
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (10 by maintainers)
Commits related to this issue
- wasm: pass PluginHandle to stream contexts. (#16795) Previously, stream contexts (i.e. filter instances) are not holding shared ptrs to Wasm plugins which take ownership of Wasm VMs, and only holding... — committed to envoyproxy/envoy by mathetake 3 years ago
- wasm: pass PluginHandle to stream contexts. (#16795) Previously, stream contexts (i.e. filter instances) are not holding shared ptrs to Wasm plugins which take ownership of Wasm VMs, and only holding... — committed to chrisxrepo/envoy by mathetake 3 years ago
- wasm: pass PluginHandle to stream contexts. (#16795) Previously, stream contexts (i.e. filter instances) are not holding shared ptrs to Wasm plugins which take ownership of Wasm VMs, and only holding... — committed to leyao-daily/envoy by mathetake 3 years ago
Update: succeeded to reproduce locally without Istio and for both Envoy’s HEAD and istio/proxy, but couldn’t identify the root cause… stay tuned.
@klarose I’m pretty confident that this is an Envoy specific issue and not of Wasm/VM/Compiler’s.