egl-wayland: failed to lock pthread mutex
Hi!
I’m hitting a segfault while trying to open Evolution (Mail client) under wayland. I’ve reported the segfault on the Fedora bugzilla [1] and since it looks like they can’t track it down, we decided to report it here.
Here’s the stack trace:
(gdb) bt full
#0 0x00007ffff3a9c625 in raise () at /lib64/libc.so.6
#1 0x00007ffff3a858d9 in abort () at /lib64/libc.so.6
#2 0x00007ffff3a857a9 in _nl_load_domain.cold () at /lib64/libc.so.6
#3 0x00007ffff3a94a66 in annobin_assert.c_end () at /lib64/libc.so.6
#4 0x00007fffe4109b7d in wlExternalApiLock () at ../src/wayland-thread.c:87
__PRETTY_FUNCTION__ = "wlExternalApiLock"
#5 0x00007fffe410e4ab in wlEglGetInternalHandleExport (dpy=0x5555566dad60, type=13233, handle=0x5555566dad60) at ../src/wayland-eglhandle.c:146
#6 0x00007fffd65574ef in () at /lib64/libEGL_nvidia.so.0
#7 0x00007fffd64deeeb in () at /lib64/libEGL_nvidia.so.0
#8 0x00007fffe410b752 in wl_eglstream_display_bind (data=data@entry=0x5555566cc5c0, wlDisplay=wlDisplay@entry=0x55555649b360, eglDisplay=eglDisplay@entry=0x5555566dad60)
at ../src/wayland-eglstream-server.c:311
wlStreamDpy = 0x555556b69f90
exts = 0x0
env = 0x0
#9 0x00007fffe410a355 in wlEglBindDisplaysHook (data=0x5555566cc5c0, dpy=0x5555566dad60, nativeDpy=0x55555649b360) at ../src/wayland-egldisplay.c:87
res = 0
#10 0x00007fffd65533f3 in () at /lib64/libEGL_nvidia.so.0
#11 0x00007fffd64db775 in () at /lib64/libEGL_nvidia.so.0
#12 0x00007ffff20f5b11 in WS::Instance::initialize(void*) () at /lib64/libWPEBackend-fdo-1.0.so.1
#13 0x00007ffff49c7bf6 in WebKit::WebProcessPool::platformInitializeWebProcess(WebKit::WebProcessProxy const&, WebKit::WebProcessCreationParameters&) (this=this@entry=0x7fffe42ee000, process=
..., parameters=...) at ../Source/WebKit/UIProcess/glib/WebProcessPoolGLib.cpp:119
#14 0x00007ffff489adfa in WebKit::WebProcessPool::initializeNewWebProcess(WebKit::WebProcessProxy&, WebKit::WebsiteDataStore*, WebKit::WebProcessProxy::IsPrewarmed)
(this=<optimized out>, process=..., websiteDataStore=0x7fffe42e4000, isPrewarmed=WebKit::WebProcessProxy::IsPrewarmed::No) at ../Source/WebKit/UIProcess/WebProcessPool.cpp:1044
initializationActivity = {m_ref = std::unique_ptr<WebKit::ProcessThrottler::Activity<(WebKit::ProcessThrottler::ActivityType)0>> = {get() = 0x0}}
parameters = <snip here>
If you need any information, i’ll gladly help.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 21
Can confirm the regression on Fedora 35 + GNOME + Wayland + Nvidia.
Fixed by 582b2d345abaa0e313cf16c902e602084ea59551
Can confirm too on Fedora 35 GNOME Wayland session and Nvidia driver
I have a fix prepared and just got sign-off on internal code review. I’ll upload it to this GitHub repo early next week.
Fix confirmed! Thanks a lot! This is how to install this version of
egl-wayland
(1.1.9-3
) on Fedora 35:sudo dnf update --enablerepo=updates-testing egl-wayland
I have the same bug on gnome-boxes, Fedora 35.
GDK_BACKEND=x11
Also fixes the issue here.If you disable the assertion and add some debug logic, it looks like the lock / unlock process fails twice after wlEglAcquireDisplay and wlEglReleaseDisplay respectively. After the initial evolution launch, no further mutex issues occur from my limited testing and Wayland functionality is 100%. This is definitely not a “fix”, but rather a simple workaround for Evolution specifically. I fear that there might be some sort of race condition, if possible, causing the mutex to get unlocked by the wrong thread.
log.txt
This is a full log of evolution starting and closing. Sorry about not including anything else useful, I’m not super familiar with C debugging.
Thanks for catching this. Looking at the backtrace with debug symbols, https://github.com/NVIDIA/egl-wayland/commit/6c12c934f82b0944805b2690390499de3b2fa859 appears to have caused the regression. Re-opening the issue.
On Debian testing and unstable as well as Ubuntu’s development version (coming 22.04), version 1:1.1.9-1.1 with the fix applied is available:
Sorry for the slow response, and thank you very much for reporting the issue. I believe the problem is that we’re calling eglQueryString from wl_eglstream_display_bind while holding the external API lock which leads to a recursive acquire. However, I’m still trying to figure out why this only seems to affect webkit. I’ll investigate a bit more and try to get a patch out as soon as possible.