orca: Hang after too many gl requests in Docker

From the tests documented starting at https://github.com/plotly/streambed/issues/9865#issuecomment-349995119:

When image-exporter is run as an imageserver in Docker, after a small number of gl requests (30-40), the image-exporter hangs completely. The request in progress times out, and the server won’t accept any more connections and must be restarted.

Two examples:

  • If test/image-make_baseline.js is used to run gl*, it hangs at gl3d_chrisp-nan-1.json.
  • If test/image-make_baseline.js is used to run gl3d*, it makes it past gl3d_chrisp-nan-1.json and hangs at gl3d_snowden_altered.json.

This means that the issue is unlikely to be specific to any one plot, but rather some resource becomes exhausted or something builds up to the point where image generation can’t proceed.

@etpinard @monfera FYI

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 39 (39 by maintainers)

Most upvoted comments

Oh I should mention. I tried running the gl image tests in docker without the ignore-gpu-blacklist flag yesterday:. Without that flag the gl images fail to generate.

@monfera @etpinard I’ve added some debugging stuff to the Docker container in #44.

  • Xvfb runs with X errors sent to STDOUT (instead of being ignored), and auditing is turned on. This means it prints a message for every X connection and disconnection, which shows if Electron is connecting and disconnecting (spoiler alert: it isn’t).
  • Xvfb’s screen dimensions have been doubled.
  • I added a VNC server, window manager, and a wrapper script.

Usage:

  • Build the image (docker build -f deployment/Dockerfile -t isdebug .), or grab it from quay.
  • Run the image as container isdebug and expose the VNC and imageserver ports: docker rm -f isdebug ; docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti isdebug
  • From a second window, connect and run the VNC wrapper: docker exec -ti isdebug /vnc
  • Connect to the VNC display localhost:0 using your favourite client, such as gvncviewer on Ubuntu or Chicken of the VNC on OS X.
  • From a third window, run whatever tests are needed, such as (in the plotly.js repo): node test/image/make_baseline.js gl* (you’ll want to modify testContainerUrl in tasks/util/constants.js, like in https://github.com/plotly/plotly.js/tree/image-exporter-testing)

In my case the image exporter window where the work was occurring was in fluxbox tab 2. Some errors were printed during normal operation (at this point images were being generated successfully):

screen shot 2017-12-11 at 21 54 05

When the server hung, a different message was printed:

screen shot 2017-12-11 at 21 54 38

I have not had the chance to investigate the significance of either of these things.

Self-assigned it on info from @scjody that @jackparmer suggested I continue with the gl plot leak detection&fix work. Jody, when work is starting on this (I understand after the couple of issues assigned to me are fixed) I’ll need some bag of representative plots that I can run the server through, to reproduce the issue locally, so that we don’t miss something.

IOW running it through all public plots is an incredibly effective way for ferreting out all memory leaks that can happen on a single render pass (it doesn’t solve leaks from interactions though).

What does this mean (if anything) for a Python distribution of this app?

It doesn’t mean anything unless we expect python users to use a docker container (which I don’t think is a good idea).

Looking at puppeteer docs, I came across GoogleChrome/puppeteer#1603 - so yeah we’re not the only one having issue with docker + chromium 🙃

I’ve been watching this issue, looks like it got closed / fixed today

There’s nothing special about the host /dev/shm other than size. On my Ubuntu system it defaults to 2 GiB, whereas Docker defaults to 64 MiB. Docker 1.10 makes shm size configurable using the --shm-size option, and I was able to do a successful run with 128 MiB.

On the Chrome side, the underlying issue has just been fixed in Chromium but there’s no Chrome release yet, and it’s certainly not in Electron… https://bugs.chromium.org/p/chromium/issues/detail?id=736452

Also Chrome may need up to 512 MiB so I suggest that as a --shm-size value if possible.

My next steps are to see if we can increase /dev/shm size (via --shm-size, mounting the host’s /dev/shm, or mounting a new volume there inside the container) in GKE and Replicated.

Just to confirm, it works with docker run -p 9091:9091 -v /dev/shm:/dev/shm isdebug

This is the minimal set of arguments:

  • -p 9091:9091 to map the port used, so I can access it from my machine
  • -v /dev/shm:/dev/shm to map in the /dev/shm device
  • isdebug the name of the image I’m using

Got it: docker run -p 9091:9091 --name isdebug -ti -v /dev/shm:/dev/shm isdebug

Everything except -v /dev/shm:/dev/shm is for my debugging convenience and shouldn’t affect the issue.

So mapping in the shared memory device works around the issue. I’ll check if this is possible in On-Prem and GKE but it seems unlikely especially with GKE 😿. It might be just that this is a good clue towards what’s going on with Electron or Chrome… at first glance this really feels like a Chrome bug but it’s too early to say.

Using an Ubuntu 16.04 VM with as close a copy of the Docker image as I could create (for obvious reasons directories like /boot needed to be preserved in their VM state), the error does not occur. (I’m still using my external Ubuntu 16.04 X server.)

This suggests that the issue is extremely specific to running under Docker. There really shouldn’t be any significant differences there (unless Electron is somehow accessing the graphics hardware directly rather than through X).

Unfortunately creating the VM was fairly time consuming and I still managed to delete some important things (VMware tools for one thing) without which the VM can’t be used. So I’m going to abandon this line of investigation unless someone else sees a lot of value in it.

Right now I’m going to reach out to Replicated to see if they have any ideas.

Maybe we should try pinging Mikola or one of his friends (e.g. Hugh Kennedy) about this topic? Perhaps they came across the same problems before.

cc @bpostlethwaite

(btw. working on the setup w/ your previous help but may not be around for too long as it’s getting late here so you have a better chance to try with the directly preceding one)

@monfera If you can think of experiments to try (related to ignore-gpu-blacklist or otherwise), please try them and let us know!

I realize Xvfb has different properties from an X server running with a real graphics card. That’s why I tried it yesterday using an X server running on Ubuntu 16.04 (with access to my laptop’s graphics card). To summarize, with image-export-server running in Docker using an external Ubuntu X server, the hang still occurred. With it running on my development machine using the same external Ubuntu X server, no hang occurred. Is it possible that Electron is accessing the graphics card in some way that does not go through the X11 protocol? This seems unlikely but if so it could explain the difference.

My next experiment is to put an externally-built (and non-hanging) image-exporter-server and Electron into a Docker container and test that.

@scjody just jumping in to see if I can help with this stuff but may just state the obvious (perhaps incorrectly assuming that your development machine uses the graphics drivers of that machine for WebGL content rendering.)

| What’s different about the app (or Electron) when it’s built and run inside Docker vs. outside?

One difference is that when Electron is used on a desktop, the WebGL API calls go through whatever graphics chip driver the desktop has, eg. from Nvidia, AMD or intel integrated graphics. In a Docker container or on CI systems in general, it’s running in headless mode, still expecting a display driver, in this case, Xvfb.

Many of our WebGL plots run on stack.gl and gl-vis which, unlike regl, don’t provide automatic resource management. The typical WebGL resources are the shader programs, and buffer bindings that link typed JS arrays to contents on the GPU, used by the shaders (uniforms, attributes, textures etc.). As there’s no automatic resource management, a small issue in one specific trace type may yield state inconsistency, eg. no buffer is bound to an enabled attribute (see on your log) or sometimes the other way around.

Various drivers, by extension, Xvfb too, have different idiosyncrasies and tolerances for handling slightly out of spec WebGL state. As we did encounter such things in the past, it might be that Xvfb is more sensitive to some of those.

Something else: apparently we’re running tests with ignore-gpu-blacklist which in turn might disable some WebGL extensions depending on the graphics stack; if we rely on one of these and a related warning gets swallowed, we might get seemingly unrelated state inconsistency reports just like the above. A candidate is OES_vertex_array_object. We seem to only use it via a polyfill but the same graphics card may process real VAO and polyfill differently, or not properly ‘disusing’ the polyfill may lead to inconsistencies too.

I might be worth trying to listen to

https://electronjs.org/docs/api/app#event-gpu-process-crashed

in the app code. I’ll give this a shot this afternoon.

@monfera The best way I’ve found is to run test/image/mocks/gl* or test/image/mocks/gl3d* (from plotly.js). Both sets of mocks will reproduce the issue, but at different places.