runtime: Frequent WebSocket Compression Segfault
Description
I run a service that heavily uses WebSockets, with anywhere from 3-8k concurrent WebSockets connections per server. These are long-lived connections; most are from 12 hours to a few days. I enabled websocket compression because size and latency is essential for my application, but I have been getting a super frequent crash due to it.
I’m running Ubuntu server 22.04 with the most up-to-date MS package repo versions of dotnet 7.0, aspnet, etc. My service uses the WebSocket class directly; I’m not using any middleware for handling the streams. I have around eight server running, and I see about 1-3 crashes per server per day due to this bug.
Please let me know if I can provide any more helpful debugging information. I’m able to attach lldb and wait for a segfault, so I can gather any other info needed.
Reproduction Steps
Set up an aspnet server that accepts websockets and enable web socket compression. Run the server as normal, allowing messages to be sent on the WebSocket and sockets to come and go. Eventually, the dotnet process will segfault, with this stack:
OS Thread Id: 0x4242a (212)
Child SP IP Call Site
00007F2BDD7F9038 00007f2c4d07f401 [InlinedCallFrame: 00007f2bdd7f9038] Interop+ZLib.Deflate(ZStream*, FlushCode)
00007F2BDD7F9038 00007f6c8109b576 [InlinedCallFrame: 00007f2bdd7f9038] Interop+ZLib.Deflate(ZStream*, FlushCode)
00007F2BDD7F9030 00007F6C8109B576 System.Net.WebSockets.Compression.WebSocketDeflater.Deflate(ZLibStreamHandle, FlushCode)
00007F2BDD7F90C0 00007F6C810F4458 System.Net.WebSockets.Compression.WebSocketDeflater.UnsafeFlush(System.Span`1<Byte>, Boolean ByRef)
00007F2BDD7F90F0 00007F6C810F4268 System.Net.WebSockets.Compression.WebSocketDeflater.DeflatePrivate(System.ReadOnlySpan`1<Byte>, System.Span`1<Byte>, Boolean, Int32 ByRef, Int32 ByRef, Boolean ByRef)
00007F2BDD7F9140 00007F6C810F4012 System.Net.WebSockets.Compression.WebSocketDeflater.Deflate(System.ReadOnlySpan`1<Byte>, Boolean)
00007F2BDD7F91C0 00007F6C80BCBD17 System.Net.WebSockets.ManagedWebSocket.WriteFrameToSendBuffer(MessageOpcode, Boolean, Boolean, System.ReadOnlySpan`1<Byte>)
00007F2BDD7F9210 00007F6C80BCB3C8 System.Net.WebSockets.ManagedWebSocket+<SendFrameFallbackAsync>d__58.MoveNext()
00007F2BDD7F9310 00007F6C80BCB0C7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.Net.WebSockets.ManagedWebSocket+<SendFrameFallbackAsync>d__58, System.Net.WebSockets]](<SendFrameFallbackAsync>d__58 ByRef)
00007F2BDD7F9350 00007F6C80BCAFE4 System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode, Boolean, Boolean, System.ReadOnlyMemory`1<Byte>, System.Threading.Tasks.Task, System.Threading.CancellationToken)
00007F2BDD7F9400 00007F6C80BCAB52 System.Net.WebSockets.ManagedWebSocket.SendAsync(System.ReadOnlyMemory`1<Byte>, System.Net.WebSockets.WebSocketMessageType, System.Net.WebSockets.WebSocketMessageFlags, System.Threading.CancellationToken)
...
<service specific code that's writing to the websocket>
Expected behavior
The dotnet process shouldn’t crash; at a minimum, it should throw a managed exception.
Actual behavior
After some time of my service running, the process segfaults with the stack above.
Regression?
This has been happening for a while, I’m not sure when it started, but it’s been at least a few months.
Known Workarounds
None. 😦
Configuration
Latest dotnet and aspnet packages from the Microsoft package repo for Ubunut Jammy. I also have the most up-to-date system packages.
dotnet sdk 7.0.302 dotnet runtime 7.0.5 aspnet 7.0.5 Ubuntu 22.04.2 LTS (Jammy Jellyfish), x64.
Other information
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (14 by maintainers)
I’ve received privately another dump (thanks Sarah) and although it didn’t contain enough information about the
System.Net.WebSockets
assembly for some reason, I was able to dump the memory layouts for objects near the faulting instruction.I think I found the culprit and it isn’t the deflate. This is the memory layout for the WebSocketDeflator which is causing the segfault:
The
f4 ff ff ff
indicates that it has been created with window bits = 12. The value before it is a pointer to the buffer we’re using internally to compress the data to7f734152a8a0
and is 1MB, but this at the moment is irrelevant. The interesting part is that there seems to be missing the pointer which should point to the current zlib stream. This can only happen if the compression has already been completed and we’re not persisting the stream or the deflator has been disposed.We know, since the data pointer is not zeroes, that the object is doing a compression at the moment. This leaves only the possibility that the stream got disposed in the middle of a compression operation.
And indeed, this can happen if the WebSocket is disposed while there is an inflight Send operation (the deflator will be corrupted) or an inflight Receive operation (the inflator will be corrupted then).
https://github.com/dotnet/runtime/blob/2e764d6eb092fc8f05c42e6f137347c3eab45fd0/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs#L208-L224
We aren’t locking the StateUpdateLock when inflating/deflating data and since Dispose() or Abort() can be and are allowed to be called concurrently, we need to take measures.
@CarnaViire @stephentoub please let me know what you think, and I will submit a PR to address the issue.
Wonderful, I will make a note to re-enable compression when I upgrade to dotnet 8!
It will definitely be in .NET 8 (and I believe even starting from Preview7 already?), however, while we might backport it to .NET 7, we decided to postpone it until there would be more reports.
Great! Sorry, I haven’t been able to help; I have been slammed. If you make a change that might fix the crash, I’m happy to test a private build!
@sossee2 I am 100% sure it is the same issue. Given that you have access violation with different callstacks and one of them originates from Dispose only confirms it.
I will prepare a PR today, however it’s up to the .NET runtime team to decide whether it will be ported back to .NET 7. There really isn’t a workaround for it.
Triage: this is something we need to investigate and fix in 8.0.
Thanks a lot for helping out @zlatanov! It would be awesome if you could also keep us in the loop. I’ll drop you an email @QuinnDamerell @zlatanov so we could chat offline.