openrc: rc: do not call free/malloc in signal handlers

I’ve spent part of this week chasing after a dead lock in “openrc shutdown” (running shutdown or reboot in a loop hangs once every few days in a malloc() call): https://www.openwall.com/lists/musl/2023/01/23/1

It turns out the problem is almost certainly that we’re doing a free in our SIGCHLD handler:

static void handle_signal(int sig) {
...
                /* Remove that pid from our list */
                if (pid > 0)
                        remove_pid(pid);
...
static void remove_pid(pid_t pid) {
...
                    LIST_REMOVE(p, entries); 
                    free(p);

free/malloc are not signal-safe, in particular musl libc does not take any lock as long as the process does not create threads – if there had been we would likely just have seen a deadlock instead – so if we have a stack like the following (bottom to top):

openrc functions...
malloc()
handle_signal()
free()

the free can be called while the malloc state is inconsistent and further corrupt mallocng’s internal state. (Yes that probably wouldn’t cause problems without rc_parallel, call me a red neck)

Anyway, there are a few more mallocs in the SIGUSR1’s rc_plugin_run but as a first approximation I’d like to make that remove_pid delay its free. The most straight-forward solution I can think of would be to have the signal handler not call remove_pid, but instead move the list link to a “to free later” list that can be checked once in a while during the main thread. That still is quite a bit of overengineering, but I cannot think of anything else that would be safe short of just leaking the memory. Would that be acceptable?

I’ll open a PR early next week unless someone beats me to it.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 19 (10 by maintainers)

Commits related to this issue

Most upvoted comments

OpenRC needs to respect async-signal-safe settings regardless of difficulty

I agree, merely toggling a flag is certainly the best handler imagineable. Checking after i.e. waitpid() and postponing maintenance till then should usually do the trick.

Just added in my 2 cents, since I opened https://bugs.gentoo.org/434532 a ‘little while’ back.