terracotta: Flask is erroring out with BrokenProcessPool
Hi there!
We have started using Terracotta in our K8S infrastructure on production. Basically we are serving the WSGI flask application (terracotta.server.app:app) using gunicorn alongside with an internal gRPC server which is taking internal requests and queries the terracotta HTTP endpoint for a singleband tile and returns it as a bytes object.
However, while the first 10-50 requests work fine, I now get this error from terracotta afterwards:
[-] Exception on /singleband/some_path/25/10/506/313.png [GET]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
return fun(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
return _get_singleband_image(keys, tile_xyz)
File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
tile_data = xyz.get_tile_data(
File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
return driver.get_raster_tile(
File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 557, in get_raster_tile
future = executor.submit(retrieve_tile)
File "/usr/lib/python3.8/concurrent/futures/process.py", line 629, in submit
raise BrokenProcessPool(self._broken)
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
The worst thing about this is that the flask application doesn’t seem to actually error out. Instead, every subsequent request throws the error above. That’s problematic as K8s then doesn’t know that the pod needs to be restarted. However, on a longer sight, this also means that we could never cater for the amount of requests (around 50 RPS) we have using terracotta if this persists.
Has anyone encountered this yet?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 26
Commits related to this issue
- respawn broken process pool (#205) — committed to DHI/terracotta by dionhaefner 3 years ago
- Merge pull request #206 from DHI-GRAS/respawn-broken-pool Respawn broken process pool (#205) — committed to DHI/terracotta by dionhaefner 3 years ago
Done
Good point. I’ll make a release later today.
That is very probable. I never saw the overall Memory / CPU usage to top any of my pod’s limits but it could be that the allocation by gunicorn did some background-magic to cause the error. Otherwise, I could also imagine that there were just too many processes running at the same time, causing this assembly of errors. Either way, thank you guys a lot for following-up and give some ideas as of what it could have been. Maybe still a valid discussion for future reference.
OKAY! So, I got it to work and I am a bit embarrassed of the underlying issue. Basically, I had far too many worked for the gunicorn Flask HTTP server for the number of Cores that I had. Reducing these to a lower number got rid of all the concurrency and IO errors as well as making my deployment as a whole more stable. There are no errors of broken pipes or core dumps anymore.
I can really recommend to switch to a MySQL db. Concurrency + SQLite has caused quite some trouble for me before…
Alright. I’ll try to add a setting to disable multiprocessing within the coming days and make a release as soon as #203 is merged. That should do as a workaround.
Permanent fix ideas, with increasing cleverness:
/rgbtakes 3x as long as/singleband.