orca: Hang after too many gl requests in Docker
From the tests documented starting at https://github.com/plotly/streambed/issues/9865#issuecomment-349995119:
When image-exporter is run as an imageserver in Docker, after a small number of gl requests (30-40), the image-exporter hangs completely. The request in progress times out, and the server won’t accept any more connections and must be restarted.
Two examples:
- If
test/image-make_baseline.js
is used to rungl*
, it hangs atgl3d_chrisp-nan-1.json
. - If
test/image-make_baseline.js
is used to rungl3d*
, it makes it pastgl3d_chrisp-nan-1.json
and hangs atgl3d_snowden_altered.json
.
This means that the issue is unlikely to be specific to any one plot, but rather some resource becomes exhausted or something builds up to the point where image generation can’t proceed.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 39 (39 by maintainers)
Oh I should mention. I tried running the gl image tests in docker without the ignore-gpu-blacklist flag yesterday:. Without that flag the gl images fail to generate.
@monfera @etpinard I’ve added some debugging stuff to the Docker container in #44.
Xvfb
runs with X errors sent to STDOUT (instead of being ignored), and auditing is turned on. This means it prints a message for every X connection and disconnection, which shows if Electron is connecting and disconnecting (spoiler alert: it isn’t).Xvfb
’s screen dimensions have been doubled.Usage:
docker build -f deployment/Dockerfile -t isdebug .
), or grab it from quay.isdebug
and expose the VNC and imageserver ports:docker rm -f isdebug ; docker run -p 9091:9091 -p 5900:5900 --name isdebug -ti isdebug
docker exec -ti isdebug /vnc
localhost:0
using your favourite client, such asgvncviewer
on Ubuntu or Chicken of the VNC on OS X.node test/image/make_baseline.js gl*
(you’ll want to modifytestContainerUrl
intasks/util/constants.js
, like in https://github.com/plotly/plotly.js/tree/image-exporter-testing)In my case the image exporter window where the work was occurring was in fluxbox tab 2. Some errors were printed during normal operation (at this point images were being generated successfully):
When the server hung, a different message was printed:
I have not had the chance to investigate the significance of either of these things.
Self-assigned it on info from @scjody that @jackparmer suggested I continue with the
gl
plot leak detection&fix work. Jody, when work is starting on this (I understand after the couple of issues assigned to me are fixed) I’ll need some bag of representative plots that I can run the server through, to reproduce the issue locally, so that we don’t miss something.IOW running it through all public plots is an incredibly effective way for ferreting out all memory leaks that can happen on a single render pass (it doesn’t solve leaks from interactions though).
It doesn’t mean anything unless we expect python users to use a docker container (which I don’t think is a good idea).
I’ve been watching this issue, looks like it got closed / fixed today
There’s nothing special about the host
/dev/shm
other than size. On my Ubuntu system it defaults to 2 GiB, whereas Docker defaults to 64 MiB. Docker 1.10 makes shm size configurable using the--shm-size
option, and I was able to do a successful run with 128 MiB.On the Chrome side, the underlying issue has just been fixed in Chromium but there’s no Chrome release yet, and it’s certainly not in Electron… https://bugs.chromium.org/p/chromium/issues/detail?id=736452
Also Chrome may need up to 512 MiB so I suggest that as a
--shm-size
value if possible.My next steps are to see if we can increase
/dev/shm
size (via--shm-size
, mounting the host’s/dev/shm
, or mounting a new volume there inside the container) in GKE and Replicated.Just to confirm, it works with
docker run -p 9091:9091 -v /dev/shm:/dev/shm isdebug
This is the minimal set of arguments:
-p 9091:9091
to map the port used, so I can access it from my machine-v /dev/shm:/dev/shm
to map in the/dev/shm
deviceisdebug
the name of the image I’m usingGot it:
docker run -p 9091:9091 --name isdebug -ti -v /dev/shm:/dev/shm isdebug
Everything except
-v /dev/shm:/dev/shm
is for my debugging convenience and shouldn’t affect the issue.So mapping in the shared memory device works around the issue. I’ll check if this is possible in On-Prem and GKE but it seems unlikely especially with GKE 😿. It might be just that this is a good clue towards what’s going on with Electron or Chrome… at first glance this really feels like a Chrome bug but it’s too early to say.
Using an Ubuntu 16.04 VM with as close a copy of the Docker image as I could create (for obvious reasons directories like
/boot
needed to be preserved in their VM state), the error does not occur. (I’m still using my external Ubuntu 16.04 X server.)This suggests that the issue is extremely specific to running under Docker. There really shouldn’t be any significant differences there (unless Electron is somehow accessing the graphics hardware directly rather than through X).
Unfortunately creating the VM was fairly time consuming and I still managed to delete some important things (VMware tools for one thing) without which the VM can’t be used. So I’m going to abandon this line of investigation unless someone else sees a lot of value in it.
Right now I’m going to reach out to Replicated to see if they have any ideas.
Maybe we should try pinging Mikola or one of his friends (e.g. Hugh Kennedy) about this topic? Perhaps they came across the same problems before.
cc @bpostlethwaite
(btw. working on the setup w/ your previous help but may not be around for too long as it’s getting late here so you have a better chance to try with the directly preceding one)
@monfera If you can think of experiments to try (related to
ignore-gpu-blacklist
or otherwise), please try them and let us know!I realize
Xvfb
has different properties from an X server running with a real graphics card. That’s why I tried it yesterday using an X server running on Ubuntu 16.04 (with access to my laptop’s graphics card). To summarize, withimage-export-server
running in Docker using an external Ubuntu X server, the hang still occurred. With it running on my development machine using the same external Ubuntu X server, no hang occurred. Is it possible that Electron is accessing the graphics card in some way that does not go through the X11 protocol? This seems unlikely but if so it could explain the difference.My next experiment is to put an externally-built (and non-hanging)
image-exporter-server
and Electron into a Docker container and test that.@scjody just jumping in to see if I can help with this stuff but may just state the obvious (perhaps incorrectly assuming that your development machine uses the graphics drivers of that machine for WebGL content rendering.)
| What’s different about the app (or Electron) when it’s built and run inside Docker vs. outside?
One difference is that when Electron is used on a desktop, the WebGL API calls go through whatever graphics chip driver the desktop has, eg. from Nvidia, AMD or intel integrated graphics. In a Docker container or on CI systems in general, it’s running in headless mode, still expecting a display driver, in this case,
Xvfb
.Many of our WebGL plots run on
stack.gl
andgl-vis
which, unlikeregl
, don’t provide automatic resource management. The typical WebGL resources are the shader programs, and buffer bindings that link typed JS arrays to contents on the GPU, used by the shaders (uniform
s,attribute
s,texture
s etc.). As there’s no automatic resource management, a small issue in one specific trace type may yield state inconsistency, eg. no buffer is bound to an enabled attribute (see on your log) or sometimes the other way around.Various drivers, by extension,
Xvfb
too, have different idiosyncrasies and tolerances for handling slightly out of spec WebGL state. As we did encounter such things in the past, it might be thatXvfb
is more sensitive to some of those.Something else: apparently we’re running tests with
ignore-gpu-blacklist
which in turn might disable some WebGL extensions depending on the graphics stack; if we rely on one of these and a related warning gets swallowed, we might get seemingly unrelated state inconsistency reports just like the above. A candidate isOES_vertex_array_object
. We seem to only use it via a polyfill but the same graphics card may process real VAO and polyfill differently, or not properly ‘disusing’ the polyfill may lead to inconsistencies too.I might be worth trying to listen to
https://electronjs.org/docs/api/app#event-gpu-process-crashed
in the app code. I’ll give this a shot this afternoon.
@monfera The best way I’ve found is to run
test/image/mocks/gl*
ortest/image/mocks/gl3d*
(from plotly.js). Both sets of mocks will reproduce the issue, but at different places.