zephyr: tests/net/socket/websocket segfaults on native_posix_64 in CI

Describe the bug

Hi, west build -p -b native_posix_64 tests/net/socket/websocket segfaults when built in CI, this blocks every PR that is touching native_posix. It used to be excluded from the build but then https://github.com/zephyrproject-rtos/zephyr/pull/54959/commits/55ac139aedc5a8517fa72cd7d71ec733260da4da reenabled it.

This can be reproduced by pushing a PR that touches the test file, such as https://github.com/zephyrproject-rtos/zephyr/pull/56537. The test runs fine locally.

I managed to get a copy of the CI binary by hacking a build and upload artifact in zephyr-testing, this is available in https://github.com/zephyrproject-rtos/zephyr-testing/actions/runs/4611534899, it can be run locally and crashes here:

(gdb) run
Starting program: /home/fabiobaltieri/zephyrproject/zephyr/zephyr.elf 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7cd66c0 (LWP 956391)]
WARNING: Using a test - not safe - entropy source
[New Thread 0x7ffff74d56c0 (LWP 956392)]
[New Thread 0x7ffff6cd46c0 (LWP 956393)]
[New Thread 0x7ffff64d36c0 (LWP 956394)]
[Thread 0x7ffff7cd66c0 (LWP 956391) exited]
[New Thread 0x7ffff7cd66c0 (LWP 956395)]
[New Thread 0x7ffff5cb26c0 (LWP 956396)]
*** Booting Zephyr OS build v2.2.0-rc1-39481-ge379ff3f98d6 ***
Running TESTSUITE net_websocket
===================================================================
START - test_recv_10_byte
[New Thread 0x7ffff54b16c0 (LWP 956397)]

Thread 8 "zephyr.elf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff54b16c0 (LWP 956397)]
0x0000555555574c0d in websocket_recv_msg (ws_sock=ws_sock@entry=1431873952, buf=buf@entry=0x55555558ac00 <recv_buf> "", buf_len=buf_len@entry=1163, message_type=message_type@entry=0x7ffff54b0d54, remaining=remaining@entry=0x7ffff54b0d58, timeout=timeout@entry=0) at /__w/zephyr-testing/zephyr-testing/subsys/net/lib/websocket/websocket.c:886
886             ctx = test_data->ctx;
(gdb) bt
#0  0x0000555555574c0d in websocket_recv_msg (ws_sock=ws_sock@entry=1431873952, buf=buf@entry=0x55555558ac00 <recv_buf> "", buf_len=buf_len@entry=1163, message_type=message_type@entry=0x7ffff54b0d54, remaining=remaining@entry=0x7ffff54b0d58, timeout=timeout@entry=0) at /__w/zephyr-testing/zephyr-testing/subsys/net/lib/websocket/websocket.c:886
#1  0x0000555555558841 in test_recv_buf (feed_buf=feed_buf@entry=0x55555558a9c0 <feed_buf> "\201\214\341~\216\271\225\033\375\315\301\023\353ʒ\037\351", <incomplete sequence \334>, feed_len=feed_len@entry=10, ctx=ctx@entry=0x7ffff54b0d60, msg_type=msg_type@entry=0x7ffff54b0d54, remaining=remaining@entry=0x7ffff54b0d58, recv_buf=recv_buf@entry=0x55555558ac00 <recv_buf> "", recv_len=1163) at /__w/zephyr-testing/zephyr-testing/tests/net/socket/websocket/src/main.c:71
#2  0x000055555555917a in test_recv (count=10) at /__w/zephyr-testing/zephyr-testing/tests/net/socket/websocket/src/main.c:128
#3  0x00005555555604c5 in run_test_functions (suite=0x555555589e60 <z_ztest_test_node_net_websocket>, test=0x555555589e98 <z_ztest_unit_test.net_websocket.test_recv_10_byte>, data=0x0) at /__w/zephyr-testing/zephyr-testing/subsys/testsuite/ztest/src/ztest_new.c:226
#4  test_cb (a=0x555555589e60 <z_ztest_test_node_net_websocket>, a@entry=<error reading variable: value has been optimized out>, b=0x555555589e98 <z_ztest_unit_test.net_websocket.test_recv_10_byte>, b@entry=<error reading variable: value has been optimized out>, c=0x0, c@entry=<error reading variable: value has been optimized out>) at /__w/zephyr-testing/zephyr-testing/subsys/testsuite/ztest/src/ztest_new.c:551
#5  0x000055555555a30e in z_thread_entry (entry=<optimized out>, p1=<optimized out>, p2=<optimized out>, p3=<optimized out>) at /__w/zephyr-testing/zephyr-testing/lib/os/thread_entry.c:36
#6  0x000055555555e296 in posix_thread_starter (arg=0x5) at /__w/zephyr-testing/zephyr-testing/arch/posix/core/posix_core.c:305
#7  0x00007ffff7d62fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#8  0x00007ffff7de366c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Still digging on the rootcause but at this point I could use some help. 😃

cc @rlubos @cfriedt

To Reproduce Steps to reproduce the behavior:

  1. https://github.com/zephyrproject-rtos/zephyr-testing/commit/e379ff3f98d6c17bc0e2ddb01aac24044f588897

Expected behavior Test passes.

Impact Blocking PRs that touch native_posix.

Environment (please complete the following information):

  • Whatever the CI is using.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Apart from the discussion on how to reproduce the problem reliably, I’ve submitted https://github.com/zephyrproject-rtos/zephyr/pull/56558 to get rid of the erroneous pointer casts, and use FD objects instead to pass the test context data. Feel free to review.