conmon: Conmon hang after wake from sleep

On my two systems, one Silverblue 35 and one Silverblue 36, I sporadically get a conmon hang, using one core fully, right after waking from sleep. Usually happens in the morning.

$ conmon --version
conmon version 2.1.0
commit: 

Unsure what info to provide; if debugging is needed I do not mind trying to get a stack trace with debug symbols.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I managed to get a strace and a gdb backtrace of a 100% CPU pinned conmon. The issue is pretty straightforward.

strace:

[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
[pid 14804] read(21, 0x7fc32ea5ec90, 16) = -1 EBADF (Bad file descriptor)
[pid 14804] poll([{fd=21, events=POLLIN}], 1, -1) = 1 ([{fd=21, revents=POLLNVAL}])
...etc...

gdb t a a bt

Thread 2 (Thread 0x7fc32ea5f640 (LWP 14804) "gmain"):
#0  0x00007fc32ee55baf in __GI___poll (fds=0x55ae87c3eed0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fc32f0f823d in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x55ae87c3eed0, timeout=<optimized out>, context=0x55ae87c3b7b0) at ../glib/gmain.c:4516
#2  g_main_context_iterate.constprop.0 (context=context@entry=0x55ae87c3b7b0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4206
#3  0x00007fc32f0a0940 in g_main_context_iteration (context=0x55ae87c3b7b0, may_block=may_block@entry=1) at ../glib/gmain.c:4276
#4  0x00007fc32f0a0991 in glib_worker_main (data=<optimized out>) at ../glib/gmain.c:6178
#5  0x00007fc32f0cd302 in g_thread_proxy (data=0x55ae87c339e0) at ../glib/gthread.c:827
#6  0x00007fc32eddce1d in start_thread (arg=<optimized out>) at pthread_create.c:442
#7  0x00007fc32ee625e0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7fc32ea607c0 (LWP 14802) "conmon"):
#0  0x00007fc32ee2ceff in __GI___wait4 (pid=pid@entry=-1, stat_loc=stat_loc@entry=0x7ffcdff603b4, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
#1  0x00007fc32ee2ce7b in __GI___waitpid (pid=pid@entry=-1, stat_loc=stat_loc@entry=0x7ffcdff603b4, options=options@entry=0) at waitpid.c:38
#2  0x000055ae8757de57 in do_exit_command () at src/ctr_exit.c:174
#3  0x00007fc32ed91085 in __run_exit_handlers (status=status@entry=1, listp=0x7fc32ef47838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:113
#4  0x00007fc32ed91200 in __GI_exit (status=status@entry=1) at exit.c:143
#5  0x000055ae8757e28b in write_sync_fd (fd=4, res=0, message=<optimized out>) at src/parent_pipe_fd.c:54
#6  0x000055ae875790ce in main (argc=<optimized out>, argv=<optimized out>) at src/conmon.c:518

So it’s the glib worker thread that’s spinning. There’s not a lot of fds that get registered in that main context, but every main context has an eventfd used for cross-thread wakeups. Someone closed that.

This looks like it was introduced in e2215a1c4c01c25f2fc1206ad4df012d10374b99, which is recent enough that it’s consistent with the bug being discovered in the last months.

I took a very brief look at the (small) subset of GLib’s API that conmon is actually using, and if I had to make an educated guess, I’d say it’s the signal handlers stuff that is causing the worker thread to be spun up. That’s irrelevant though: the core issue here is that you can’t just randomly close all fds like that.

Out of interest, as this is still causing CPU/fan overheating (especially during this hot summer) – is that fix going to go into Fedora soon? Or wasn’t it fixed after all yet? I still have to pkill -9 conmon every day.