godot: Occasional `attempting free on address which was not malloc()-ed` crash in CI

Godot version

master CI builds

System information

Linux / Editor with doubles and GCC sanitizers

Issue description

The Linux / Editor with doubles and GCC sanitizers build can fail with a attempting free on address which was not malloc()-ed crash in the Open and close editor (Vulkan) step.

https://github.com/godotengine/godot/actions/runs/5366302204/jobs/9735786296#step:15:138 https://github.com/godotengine/godot/actions/runs/5389199469/jobs/9783070708?pr=78740#step:15:138

The failure doesn’t seem related to either PR (in fact one of them is just a docs change). I have only seen this issue these two times, still probably worth looking into.

Steps to reproduce

Open and close https://github.com/godotengine/regression-test-project/archive/4.0.zip using the Vulkan renderer.

Minimal reproduction project

https://github.com/godotengine/regression-test-project/archive/4.0.zip

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 4
  • Comments: 20 (20 by maintainers)

Most upvoted comments

A lot of running stuff in GH actions later:

  1. the crash seems to happen somewhat more with single precision builds instead of double precision builds (~1/6 instead of ~1/16)
  2. The crash only happens on the first run, if the caches (and/or settings) exist it will never crash.
  3. The crash is not specific to an executable / some compiler ordering choices,
  4. The crash NOT exclusive to SCU, the same crash also happens with scu_build=no
  5. The crash is exclusive to vulkan, neither opengl nor headless have crashed on 240 open attempts.
  6. I can’t reproduce in my machine, not sure If I do something wrong or if something is just different on my machine so that these crashes don’t happen.
  7. Possibly related, sometimes this asan check failed messages appears instead of the normal crash (this doesn’t fail CI)

==13718==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/asan/asan_allocator.cc:141 "((m->chunk_state)) == ((CHUNK_QUARANTINE))" (0x65, 0x3)
    #0 0x7f13257208be in AsanCheckFailed ../../../../src/libsanitizer/asan/asan_rtl.cc:72
    #1 0x7f132573eeee in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cc:77
    #2 0x7f132563350d in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../src/libsanitizer/asan/asan_allocator.cc:141
    #3 0x7f132563389d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:191
    #4 0x7f1325633db4 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:179
    #5 0x7f1325635b14 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:111
    #6 0x7f1325635b14 in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../src/libsanitizer/asan/asan_allocator.cc:598
    #7 0x7f13257163d8 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:127
    #8 0x7f131386430b  (/usr/lib/x86_64-linux-gnu/libvulkan_lvp.so+0x6430b)
    #9 0x55ab3f590934 in VmaAllocator_T::FreeVulkanMemory(unsigned int, unsigned long, VkDeviceMemory_T*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43aa2934)
    #10 0x55ab3f55118f in VmaDeviceMemoryBlock::Destroy(VmaAllocator_T*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a6318f)
    #11 0x55ab3f559459 in VmaBlockVector::~VmaBlockVector() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a6b459)
    #12 0x55ab3f5a4c25  (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43ab6c25)
    #13 0x55ab3f57b194 in VmaAllocator_T::~VmaAllocator_T() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a8d194)
    #14 0x55ab3f5a512b  (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43ab712b)
    #15 0x55ab3f598138 in vmaDestroyAllocator (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43aaa138)
    #16 0x55ab3ecd90d4 in RenderingDeviceVulkan::finalize() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x431eb0d4)
    #17 0x55ab36fd4025 in DisplayServerX11::~DisplayServerX11() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b4e6025)
    #18 0x55ab371b3a75 in void memdelete<DisplayServer>(DisplayServer*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b6c5a75)
    #19 0x55ab37128b12 in finalize_display() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b63ab12)
    #20 0x55ab37190485 in Main::cleanup(bool) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b6a2485)
    #21 0x55ab36eac969 in main (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b3be969)
    #22 0x7f1324956082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #23 0x55ab36eac32d in _start (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b3be32d)

  1. Branch with my experiments: https://github.com/RedworkDE/godot/tree/scu-stress-test (nothing too interesting here) and script I used for local testing (run in repo root & have the test project zip in there; warning: (tries to) delete all godot config and caches):
Details

#!/bin/bash -ex

i=0
scons platform=linuxbsd target=editor tests=no verbose=yes warnings=extra werror=yes module_text_server_fb_enabled=yes dev_build=yes scu_build=yes debug_symbols=no use_asan=yes use_ubsan=yes linker=gold
while true; do
    ((i+=1))
    echo $"Attempt $i"

    rm -rf test_project
    rm -rf ~/.local/share/godot/
    rm -rf ~/.config/godot/
    rm -rf ~/.cache/godot/
    unzip 4.0.zip
    mv "regression-test-project-4.0" "test_project"

    bin/godot.linuxbsd.editor.dev.x86_64.san --audio-driver Dummy --editor --quit --path test_project 2>&1 | tee sanitizers_log.txt || true
    misc/scripts/check_ci_log.py sanitizers_log.txt || break;
done

CC @warriormaster12 @RandomShaper Best I can tell is that the multithreaded saving of the shader cache causees memory corruption. possibly of asans tracking data and/or by the vulkan driver. Calling _save_pipeline_cache_threaded synchronously seems to solve the issue, but I don’t really understand what is happening and will now stop looking into this.

  • ~Bisecting is still going to take a while (a 1-3h per commit to test) but in the current range (180a5eb981f09b9e0d967f5e67bec2206a75fa63…5c2295ff538312884115c2b7a3aec1e301b8b954) #76348 looks like a likely candidate for the cause~ Confirmed to be caused by #76348

I guess we could follow the situation and close the issue if nothing changes maybe in another couple of weeks?

My pet theory is that https://github.com/godotengine/godot/pull/81771 fixed these CI Vulkan pipeline cache crashes. Earlier, I noticed some weirdness with Vulkan when run with --gpu-validation and Godot immediately terminated when ran with --gpu-abort and the validation layer error message explicity mentioned invalid pipeline cache use. I was thinking about reporting this, but that PR fixed the errors and seemed to happen around the same time the CI errors stopped, too. But this is just a guess.

Atm the moment the only option is to comment code related to pipeline caching. Commenting function _save_pipeline_cache_threaded could be a start. Better option of course would be to have a launch argument and maybe a project setting?

Sorry I’ve been out of action for a bit (my new PC arrived yesterday and is almost setup for compiling). We are planning to modify SCU build slightly soon so the CI can specify maximum SCU file size which will limit RAM use, at the moment it will be very high especially with sanitizers. I can help investigate this next week, but feel free to turn off SCU for this action temporarily, or use it on a simpler (non sanitizer / tests) build.

SCU build on sanitizer / test build?

If we want to run SCU in CI, we need to verify that running it on the sanitizer / tests run (because it is slowest) is actually the best choice. Although in theory SCU should accelerate CI too, when I looked at the actual build times (for CI anyway) they weren’t necessarily faster than regular build. This may be due to e.g. the CI caching, the larger translation units, or simply running low on RAM. There’s also no guarantee that the SCU build will operate exactly as the regular build, if there are code errors. The gold standard for sanitizer errors should probably be the regular build imo (although SCU sanitizer build will also reveal different bugs).

Some ideas for what might be happening

SCU build has fewer large files compared to normal build with lots of small files, so they can stress things in different ways. If the build itself is working ok, but the sanitizers are failing on a test, then that would seem to indicate e.g. a corrupt build, maybe a order construction / destruction problem, or a timing difference exposing race condition.

I’m not an expert on github actions, but some possible ideas.

  • Running out of RAM on certain runs, causing spurious errors (very likely)
  • Problem with the SCons caching
  • Timing issue with files in the CI and file locks
  • Undefined order of construction / destruction problem, which is revealed by SCU build and the sanitizers

Some things we might try to figure out the culprit:

  • limiting the number of files per SCU (lower RAM)
  • turning off CI caching

Order of construction / destruction

(this is only small chance of being the problem, but worth considering if reader is not familiar)

Order of construction / destruction bugs (usually globals / statics / possibly singletons) can be extremely nasty. These are why many programmers prefer explicit construction / destruction functions for globals rather than relying on constructor / destructor operators - i.e. defined in code rather than determined by the compiler. Whether or not this is causing the problem, we should probably consider having a test for order of construction / destruction in the CI.

This is an achilles heel of C++: order of construction and destruction of globals between translation units is undefined, and it can change from build to build, resulting in bugs on some builds but not in others.

See: https://en.cppreference.com/w/cpp/language/siof https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems https://github.com/google/sanitizers/wiki/AddressSanitizerInitializationOrderFiasco

With a SCU build in particular, any globals within the larger translation unit, the order will now be determined by where they appear in the “uber file”, so it can expose existing order of construction bugs.

This is failing on the SCU build, and these errors started after enabling it, so it’s probably related.