renderdoc: Race condition results in Android replay server behaving like a targeted app

Description

The RenderDoc Android replay server is subject to a race condition on startup that results in its libVkLayer_GLES_RenderDoc.so thinking it’s not running in the replay server. The library thus acts like it’s in an app to be targeted/debugged and opens a control socket (38920) instead of a replay server socket (39920). The most obvious symptom of this is that when you ask RenderDoc to launch an app for debugging, it displays a prompt that says something else is already being debugged (which you know is false). You can then go to File > Attach To Running Instance and you will see the replay server app (e.g., org.renderdoc.renderdoccmd.arm64) listed as something you can attach to, which should never happen.

This doesn’t happen on all devices, but it happens on our devices that have Android 12. A difference between the two situations is the native libraries that are loaded implicitly when the replay server APK starts. In the problematic case (our devices), libhwui.so is loaded immediately when the apk is started. In the working cases (other devices) libhwui.so isn’t loaded. That is, the same exact APK experiences different library loading in the working and failing environments. I’ll provide more details at the end.

renderdoc_race_condition

The reason libhwui being in the mix is problematic is that it launches a dedicated thread (1) to initialize the Vulkan/GLES environments for the process. http://aospxref.com/android-12.0.0_r3/xref/frameworks/base/libs/hwui/renderthread/RenderThread.cpp#187 This thread starts making vulkan calls (2), and libvulkan.so will, upon the first significant API call, initiate its layer loading. That logic looks for layer libraries in an apk’s lib directory unconditionally (i.e., no properties need to be set). See http://aospxref.com/android-12.0.0_r3/xref/frameworks/base/core/java/android/os/GraphicsEnvironment.java#331 That means that very shortly after the replay server apk starts running, libvulkan will load libVkLayer_GLES_RenderDoc.so (3).

Meanwhile, back on the main thread…the process is busy going other initialization after loading libhwui.so and execution eventually gets to the Loader java class in the replay server that does a System.loadLibrary("renderdoccmd") (4), which loads the replay server logic, all of which is C++ code built into librenderdoccmd.so.

If the main thread doesn’t get to that Java loadLibrary() call before the secondary libhwui thread exercises the vulkan layer loading, then we have a bad situation. The “DllMain” in libVkLayer_GLES_RenderDoc.so tries to determine if it’s in the replay server by doing a dlysm(“renderdoc__replay__marker”). That symbol exists in librenderdoccmd.so. If libVkLayer_GLES_RenderDoc.so was first loaded by the vulkan layer loading, then the library’s DllMain() ran at that time and failed, since librenderdoccmd.so hasn’t yet been loaded.

That dlsym(marker) approach works fine when libVkLayer_GLES_RenderDoc.so is loaded by the loading of renderdoccmd.so in Loader.java, which is how things are supposed to happen. But this secondary thread created by libhwui throws a wrench into things by creating the possibility that libVkLayer_GLES_RenderDoc.so gets loaded prematurely. For our devices it happens more often than not–anywhere from 75% to 95% of the time, depending on the device and build.

I did not find an elegant solution other than to replace the dlsym(marker) call with a check of the process executable and compare it to the name of the Android replay server. I conditionalized this as alternate logic for Android only, of course. If someone can think of a better solution, I’m all for it. I’ll submit the change as a pull request. In general though, it seems hazardous/fragile for libVkLayer_GLES_RenderDoc.so to assume the other library in the replay server (renderdoccmd.so) will be loaded first. Normally (desktop and embedded Linux), the replay server is just an executable and so the assumption is rock solid; the executable will of course be loaded prior to any library the executable dlopens.

Finally, I want to provide more details on why this affects our Android devices but not others I’ve tried (production Android phones). I mentioned early that I noticed a stark difference in the activity in the dynamic loader when launching the replay server APK. Here are some details, in case they provide clues.

When I enable tracing in the bionic loader of a production phone (lacks problem), I see that the first dlopen() in the replay server process is for /system/framework/oat/arm64/org.apache.http.legacy.odex, done by /apex/com.android.art/lib64/libart.so. On our devices (has problem), the first dlopen is for libandroid.so done by /system/lib64/libhwui.so. To enable loader tracing (to logcat), simply do adb shell setprop debug.ld.app.org.renderdoc.renderdoccmd.arm64 dlopen

Steps to reproduce

This happens on our devices but not production phones I’ve tried. Also, this has nothing to do with any app. The problem happens entirely within the replay server apk.

For Android devices where the premature loading of libVkLayer_GLES_RenderDoc.so happens

  1. Launch RenderDoc
  2. Select the Android device in the replay server
  3. Try to launch any app from RenderDoc. You’ll get a dialog saying you are already debugging something (although you’re not)
  4. Open the Attach to Running Instance and you’ll see the replay server listed as something you can attach to (which makes no sense).

Environment

  • RenderDoc version: 1.20
  • Operating System: Host: All Device: Android 12
  • Graphics API: N/A since no app is involved.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 38 (33 by maintainers)

Commits related to this issue

Most upvoted comments

Baldur, interesting results from trying the fix.

It works, which is good. The curious part is that the fix has altered how the replay server’s dynamic linking plays out. I now see libart.so getting loaded early on, which is something that wasn’t happening without the wrapper script. The end result is that the race condition no longer happens. Consequently, in logcat, I see the following every time

08-18 11:18:05.619  2650  2650 I renderdoc: @62fe663d00000001@ RDOC 002650: [11:18:05] 	android_hook.cpp( 683) - Log 	- Detecting symbol renderdoc__replay__marker by dlsym: yes
08-18 11:18:05.620  2650  2650 I renderdoc: @62fe663d00000002@ RDOC 002650: [11:18:05] 	android_hook.cpp( 684) - Log 	- Detecting symbol renderdoc__replay__marker by getenv: yes

What this means is that the setting of the env variable isn’t actually needed, which is a bit ironic. The purpose of the change is to introduce an environment variable that can be used to detect the replay server context, but the mechanism used makes the environment variable unnecessary 😃

Bottom line, though…the change fixes the problem, and that’s what’s important.