godot: Occasional `attempting free on address which was not malloc()-ed` crash in CI
Godot version
master CI builds
System information
Linux / Editor with doubles and GCC sanitizers
Issue description
The Linux / Editor with doubles and GCC sanitizers
build can fail with a attempting free on address which was not malloc()-ed
crash in the Open and close editor (Vulkan)
step.
https://github.com/godotengine/godot/actions/runs/5366302204/jobs/9735786296#step:15:138 https://github.com/godotengine/godot/actions/runs/5389199469/jobs/9783070708?pr=78740#step:15:138
The failure doesn’t seem related to either PR (in fact one of them is just a docs change). I have only seen this issue these two times, still probably worth looking into.
Steps to reproduce
Open and close https://github.com/godotengine/regression-test-project/archive/4.0.zip using the Vulkan renderer.
Minimal reproduction project
https://github.com/godotengine/regression-test-project/archive/4.0.zip
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 4
- Comments: 20 (20 by maintainers)
A lot of running stuff in GH actions later:
scu_build=no
Details
CC @warriormaster12 @RandomShaper Best I can tell is that the multithreaded saving of the shader cache causees memory corruption. possibly of asans tracking data and/or by the vulkan driver. Calling
_save_pipeline_cache_threaded
synchronously seems to solve the issue, but I don’t really understand what is happening and will now stop looking into this.I guess we could follow the situation and close the issue if nothing changes maybe in another couple of weeks?
My pet theory is that https://github.com/godotengine/godot/pull/81771 fixed these CI Vulkan pipeline cache crashes. Earlier, I noticed some weirdness with Vulkan when run with
--gpu-validation
and Godot immediately terminated when ran with--gpu-abort
and the validation layer error message explicity mentioned invalid pipeline cache use. I was thinking about reporting this, but that PR fixed the errors and seemed to happen around the same time the CI errors stopped, too. But this is just a guess.Atm the moment the only option is to comment code related to pipeline caching. Commenting function
_save_pipeline_cache_threaded
could be a start. Better option of course would be to have a launch argument and maybe a project setting?Sorry I’ve been out of action for a bit (my new PC arrived yesterday and is almost setup for compiling). We are planning to modify SCU build slightly soon so the CI can specify maximum SCU file size which will limit RAM use, at the moment it will be very high especially with sanitizers. I can help investigate this next week, but feel free to turn off SCU for this action temporarily, or use it on a simpler (non sanitizer / tests) build.
SCU build on sanitizer / test build?
If we want to run SCU in CI, we need to verify that running it on the sanitizer / tests run (because it is slowest) is actually the best choice. Although in theory SCU should accelerate CI too, when I looked at the actual build times (for CI anyway) they weren’t necessarily faster than regular build. This may be due to e.g. the CI caching, the larger translation units, or simply running low on RAM. There’s also no guarantee that the SCU build will operate exactly as the regular build, if there are code errors. The gold standard for sanitizer errors should probably be the regular build imo (although SCU sanitizer build will also reveal different bugs).
Some ideas for what might be happening
SCU build has fewer large files compared to normal build with lots of small files, so they can stress things in different ways. If the build itself is working ok, but the sanitizers are failing on a test, then that would seem to indicate e.g. a corrupt build, maybe a order construction / destruction problem, or a timing difference exposing race condition.
I’m not an expert on github actions, but some possible ideas.
Some things we might try to figure out the culprit:
Order of construction / destruction
(this is only small chance of being the problem, but worth considering if reader is not familiar)
Order of construction / destruction bugs (usually globals / statics / possibly singletons) can be extremely nasty. These are why many programmers prefer explicit construction / destruction functions for globals rather than relying on constructor / destructor operators - i.e. defined in code rather than determined by the compiler. Whether or not this is causing the problem, we should probably consider having a test for order of construction / destruction in the CI.
This is an achilles heel of C++: order of construction and destruction of globals between translation units is undefined, and it can change from build to build, resulting in bugs on some builds but not in others.
See: https://en.cppreference.com/w/cpp/language/siof https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems https://github.com/google/sanitizers/wiki/AddressSanitizerInitializationOrderFiasco
With a SCU build in particular, any globals within the larger translation unit, the order will now be determined by where they appear in the “uber file”, so it can expose existing order of construction bugs.
This is failing on the SCU build, and these errors started after enabling it, so it’s probably related.