hammerspoon: Root causing our "weird" crash logs
So, this is not the first time I’ve decided I’m determined to get to the bottom of why we always have some “weird” crash reports, and it probably won’t be the last time, but I’d like to discuss it again anyway.
Tagging @asmagill for his excellent thoughts.
Here are the “weird” crash reports received thus far for 0.9.84:
- https://sentry.io/organizations/hammerspoon/issues/2264596621/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264749660/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264676679/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2263890376/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264076080/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2263909041/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2252842063/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2262301597/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2260433876/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2254695465/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
Here are some evidence points I have so far:
- They all start with an OS callback that leads to us calling
protectedCallAndTraceback. There are a couple of others that happen during the first load ofinit.lua, but we’ll ignore those for now. - Generally speaking they are all something to do with pointers not being what Lua is expecting (either NULL or garbage), which strongly suggests (imo) that this is some kind of lifecycle issue - ie they’re storing a reference to something that’s been subsequently deallocated.
- We don’t get very many of these - in the week since 0.9.84 was released these particular crashes only number around 20, out of thousands of user sessions. Sentry doesn’t break down a ton of detail here (in part because we deliberately don’t enable active session tracking), but they report well over 99% of sessions are crash free. Still, it annoys me to know that these crashes all hint at some kind of lifecycle bug we haven’t found yet.
- Looking at the backtraces, they mostly fall into either hs.timer or hs.eventtap callbacks, but there are others, which suggests to me that it’s not a specific bug in either of those modules, but rather a systemic flaw.
- We have protections against (at least):
- trying to call Lua off the main thread
- lua_State having been replaced since the callback was set up, at least in hs.timer (via the relatively new checkLuaSkinInstance in 0.9.83)
- Lua stack overgrowth via the
_lua_stackguardmacros
- Looking at the breadcrumb logs from the crashes, it’s not generally the case that the crash happens imminently after a LuaSkin reload
- Because they all start from C, they call
[LuaSkin sharedWithState:NULL]which guarantees them the main Lua State, they’re not coming in on a co-routine state (this kind of crash report predates co-routines anyway AFAIK, but we don’t have historical data to back that up). This would also be covered by the LuaSkin UUID check, at least in hs.timer.
So, it seems relatively safe to say that these aren’t happening when Lua is being torn down, and their relative consistency makes me suspect they’re not stack/heap corruption, although I can’t rule that out. I have some hypotheses:
- They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it’s not possible)
- Somewhere we’re running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we’re in the middle of a Lua->C situation and the Lua state gets confused
- We just have one or more memory corruption bugs in some C code somewhere
- We have a very subtle Lua API interaction bug
- We’re hitting genuine bugs in Lua (seems unlikely that they would persist for this long without other much fancier users of Lua, noticing)
My bet is on 2, 3 or 4, but I don’t have any evidence to support that yet.
So… thoughts? 😃
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 43 (43 by maintainers)
Good people of the Internet, I believe I have finally root-caused at least one variant of these crashes.
I was looking through the 8 hs.audiodevice related instances of
checkLuaSkinInstancefailing and I noticed that in one of them, while some time had passed between starting HS and the crash, at no point had Lua been reloaded.That ruled out the idea that these objects were leftovers from a previous Lua instance. However, as I was looking around the hs.audiodevice code, I looked at
audiodevice_callback()and started thinking about what our use ofdispatch_async()actually means is going to happen.The code will run on the next iteration of the Objective C runloop, but that won’t be until Lua has finished doing whatever it was already doing, and that led me to the realisation that if Lua is working on stuff that happens to trigger one of these C callbacks, it’s also possible that while Lua is then still doing other things, it might decide to garbage collect the object.
This leaves you in a situation where Lua has discarded an object, but Objective C is going to do something with that object on its next runloop iteration. Since Lua doesn’t seem to defensively wipe the memory used by a userdata object, it’s still likely to be present enough for the callback to be able to do something with the pointer, but it quickly goes off the rails.
I’m not 100% sure if this hs.audiodevice instance is identical to the ones that had been plaguing hs.timer and hs.eventtap, but I can well believe it is, and I guess we’ll find out as 0.9.88 rolls out more widely and the Sentry data starts coming in.
For reference, the reproducer is:
(it only works if your audio volume is something other than 80% to start with)
The
:setVolume()triggers the C portion of the watcher callback, which schedules part of itself to run on the main thread viadispatch_async(), but before that happens,foo(ie thehs.audiodeviceobject) is forcibly garbage collected.So, now the hard part comes:
Do we switch to
dispatch_sync()or do we stick with async and rely oncheckLuaInstanceto protect us? @asmagill this is definitely one I’d appreciate your help with!Ok, so now that we have data from 0.9.89 it’s clear that we didn’t fix everything here. There are a few crash reports that started out in hs.hotkey and looking more closely at the code, I understand why and have a rough idea how to fix it. There’s also a couple from people running things in the Console, that I don’t understand yet, and one from hs.chooser which makes some sense because I didn’t add GCCanary support there.
Finally, there are the lingering crashes from hs.webview.toolbar that I suspect are related to garbage collection, so we should think about a smart way to avoid crashing in those scenarios.
@latenitefilms #2859 and #2860 will be relevant here - the former just renames all the
luaSkinUUIDstuff tolsGCCanary, but the latter adds canary checks to most of the modules that we/CP are seeing the crashes in currently.hs.hotkey does check for
LUA_NOREFand somehow was still crashing, so I’m now a little more confused than I was. I’m going to quickly push out 0.9.89 which addscheckLuaSkinInstanceto hs.hotkey and I guess we’ll need to wait a week or so until we have a good chunk of the userbase upgraded.Edit: On second thought, I won’t do that, I’ve changed enough stuff today that I should wait and test some more before releasing again.
that’s a really good point actually, and explains why hs.audiodevice wasn’t showing up in the crash reports before, because it does check the ref. hs.timer doesn’t.
hs.eventtap does though, so I’m digging back through Sentry to try and find some of the crashes there to see where it was going wrong.
Ok, took a moment to look at the watcher code and it is checking, but it still creates the skin instance and issues the
_lua_stackguard_entry(skin.L)before checking the callback ref value… do we know specifically which line in the audio callback is triggering the issue? Would checking the callback ref value before doing the stackguard check make a difference?Don’t have the code in front of me at the moment, and I’m about to head out the door, but my first question is are we checking that the callback ref hasn’t been cleared by __gc in the async block? That has been sufficient in most of the cases we’ve run into this runloop queue vs timing issue before.
In general async will allow the system to be more responsive, but sync is required if we require feedback from the Lua callback function… Switching everything to
syncmay make things less responsive in some cases.If my reasoning there is correct, then yes, by accident I wrote a debugging feature that ended up indirectly doing what we needed. If that is the decision we come to, I’ll rework it so it doesn’t log things to Sentry, because we will have agreed that this isn’t really a bug, but an expected side-effect of userdata lifecycles, and then we’ll need to roll it out to every C->Lua entrypoint.
It’ll need a new name too, maybe something like
checkGarbageCanary.@latenitefilms it’s coming up on half-midnight here, I am not sufficiently brained to reason about that right now, but I believe the original thought behind using
dispatch_async()was to let the C portion of the callback complete ASAP and do the Lua portion of the callback later. For things likehs.eventtapthis actually matters because the OS will start dropping callbacks that take too long to execute, and there’s potentially no upper limit on how long the Lua portion might run for.I don’t have a counter yet, and curiously, a lot of the crashes seem to be happening long after a reload (as much as a couple of hours later, in one instance).
Ah, interesting… does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser?
I take it as a positive sign - it likely means that people are using our API in ways we didn’t think of and are triggering weirdness.
The challenge now, is to figure out what it is. If one/several of them would file GH issues, this would go a lot quicker, I suspect!
I’m going to put out a release now, to get these updated sanitiser/assert things in the wild, and we can see what reports come in over the next week (assuming it doesn’t all explode immediately!)