terracotta: Flask is erroring out with BrokenProcessPool

Hi there!

We have started using Terracotta in our K8S infrastructure on production. Basically we are serving the WSGI flask application (terracotta.server.app:app) using gunicorn alongside with an internal gRPC server which is taking internal requests and queries the terracotta HTTP endpoint for a singleband tile and returns it as a bytes object.

However, while the first 10-50 requests work fine, I now get this error from terracotta afterwards:

 [-] Exception on /singleband/some_path/25/10/506/313.png [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
    return _get_singleband_image(keys, tile_xyz)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
    image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
    tile_data = xyz.get_tile_data(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
    return driver.get_raster_tile(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 557, in get_raster_tile
    future = executor.submit(retrieve_tile)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 629, in submit
    raise BrokenProcessPool(self._broken)
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

The worst thing about this is that the flask application doesn’t seem to actually error out. Instead, every subsequent request throws the error above. That’s problematic as K8s then doesn’t know that the pod needs to be restarted. However, on a longer sight, this also means that we could never cater for the amount of requests (around 50 RPS) we have using terracotta if this persists.

Has anyone encountered this yet?

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 26

Commits related to this issue

respawn broken process pool (#205) — committed to DHI/terracotta by dionhaefner 3 years ago
Merge pull request #206 from DHI-GRAS/respawn-broken-pool Respawn broken process pool (#205) — committed to DHI/terracotta by dionhaefner 3 years ago

Most upvoted comments

Done

dionhaefner on May 12, 2021

Good point. I’ll make a release later today.

dionhaefner on May 12, 2021

That is very probable. I never saw the overall Memory / CPU usage to top any of my pod’s limits but it could be that the allocation by gunicorn did some background-magic to cause the error. Otherwise, I could also imagine that there were just too many processes running at the same time, causing this assembly of errors. Either way, thank you guys a lot for following-up and give some ideas as of what it could have been. Maybe still a valid discussion for future reference.

rico-ci on Apr 28, 2021

OKAY! So, I got it to work and I am a bit embarrassed of the underlying issue. Basically, I had far too many worked for the gunicorn Flask HTTP server for the number of Cores that I had. Reducing these to a lower number got rid of all the concurrency and IO errors as well as making my deployment as a whole more stable. There are no errors of broken pipes or core dumps anymore.

rico-ci on Apr 28, 2021

I can really recommend to switch to a MySQL db. Concurrency + SQLite has caused quite some trouble for me before…

j08lue on Apr 27, 2021

Alright. I’ll try to add a setting to disable multiprocessing within the coming days and make a release as soon as #203 is merged. That should do as a workaround.

Permanent fix ideas, with increasing cleverness:

Disable multiprocessing by default and accept that /rgb takes 3x as long as /singleband.
Try multithreading again with GDAL 3, maybe the race condition is fixed by now.
Detect whether multi-process pool is broken and spawn new workers as needed.

dionhaefner on Apr 26, 2021