concourse: long periods of no output + worker rebalancing causes containerd containers to exit
Summary
We are running tanzu bbr and om processes in concourse that take a long time with no output. Specifically, when the bbr validate process runs on a 500+GB blobstore OR a stemcell upgrade in an environment with many instances of on-demand redsi/rabbitmq. The errand ‘upgrade-all-service-instances’ in the redis/rabbitmq update does not provide any output until all the service instances are updated.
When we migrated from 6.7.2 to 7.0.0 and enabled our workers to use runtime containerd instead of garden, we experienced the task being killed after 1 hour of output inactivity with the following message:
Backend error: Exit status: 500, message: {"Type":"","Message":"load proc: process does not exist task: not found: unknown","Handle":"","ProcessID":"","Binary":""}
When we switch back to garden, these problems go away
We could not find any settings to control this behavior, although we did try: containerd.request_timeout: 24h (this is a BOSH deployment of Concourse)
Steps to reproduce
Run bbr or om on a process that goes over 1h without output
Expected results
Task completes if the command completes
Actual results
Task is terminated with:
Backend error: Exit status: 500, message: {"Type":"","Message":"load proc: process does not exist task: not found: unknown","Handle":"","ProcessID":"","Binary":""}
Additional context
Triaging info
- Concourse version: 7.0.0
- Browser (if applicable): N/A
- Did this used to work? Yes - using garden runtime and it still does with garden runtime
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 18 (12 by maintainers)
Commits related to this issue
- B: wrap stdin when using containerd with TTY [#6632] containerd treats closing of stdin as a SIGHUP to the container. But because we're forwarding stdio over the internet, we might run into connecti... — committed to concourse/concourse by chenbh 3 years ago
- B: wrap stdin when using containerd with TTY [#6632] containerd treats closing of stdin as a SIGHUP to the container. But because we're forwarding stdio over the internet, we might run into connecti... — committed to concourse/concourse by chenbh 3 years ago
Sure enough, we setup a job in https://ci.concourse-ci.org/teams/main/pipelines/test-issue-6632, and saw the issue occur again. In the logs, right before the
process does not existerror, we again see:…so it definitely seems related to rebalancing
The interesting log lines from the web:
Going to dig into this more later today.
Oh, try this one - it doesn’t kill the processor - @taylorsilva
Sure, I’ll let it run longer. Had to do other work so that’s why I eventually brought it down. Running it also turns my mac into a nice space heater 😝
New information. Apparently, my initial 1 hour assertion is false (or incomplete). Analyzing repeated failures of the above pipeline show a pattern where the pipeline is terminated with the above error at the same minute/second (different hour) every occurrence. At 43 minutes and 51 seconds after the hour, each of these runs was terminated.
That is ODD!