concourse: long periods of no output + worker rebalancing causes containerd containers to exit

Summary

We are running tanzu bbr and om processes in concourse that take a long time with no output. Specifically, when the bbr validate process runs on a 500+GB blobstore OR a stemcell upgrade in an environment with many instances of on-demand redsi/rabbitmq. The errand ‘upgrade-all-service-instances’ in the redis/rabbitmq update does not provide any output until all the service instances are updated.

When we migrated from 6.7.2 to 7.0.0 and enabled our workers to use runtime containerd instead of garden, we experienced the task being killed after 1 hour of output inactivity with the following message:

Backend error: Exit status: 500, message: {"Type":"","Message":"load proc: process does not exist task: not found: unknown","Handle":"","ProcessID":"","Binary":""}

When we switch back to garden, these problems go away

We could not find any settings to control this behavior, although we did try: containerd.request_timeout: 24h (this is a BOSH deployment of Concourse)

Steps to reproduce

Run bbr or om on a process that goes over 1h without output

Expected results

Task completes if the command completes

Actual results

Task is terminated with:

Backend error: Exit status: 500, message: {"Type":"","Message":"load proc: process does not exist task: not found: unknown","Handle":"","ProcessID":"","Binary":""}

Additional context

Triaging info

  • Concourse version: 7.0.0
  • Browser (if applicable): N/A
  • Did this used to work? Yes - using garden runtime and it still does with garden runtime

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Sure enough, we setup a job in https://ci.concourse-ci.org/teams/main/pipelines/test-issue-6632, and saw the issue occur again. In the logs, right before the process does not exist error, we again see:

worker.beacon-runner.beacon.rebalance.run.command-exited-after-signal

…so it definitely seems related to rebalancing

The interesting log lines from the web:

{"timestamp":"2021-03-05T02:30:11.358938100Z","level":"debug","source":"atc","message":"atc.tracker.notify.run.task-step.registering-outputs","data":{"build":"2","build_id":2,"job":"sleeptest","job-id":1,"outputs":null,"pipeline":"wait-no-output-6632","session":"22.62.2.3","step-name":"sleeptest","team":"main"}}
{"timestamp":"2021-03-05T02:30:11.359003100Z","level":"info","source":"atc","message":"atc.tracker.notify.run.errored","data":{"build":"2","build_id":2,"error":"Backend error: Exit status: 500, message: {\"Type\":\"\",\"Message\":\"load proc: process does not exist task: not found: unknown\",\"Handle\":\"\",\"ProcessID\":\"\",\"Binary\":\"\"}\n","job":"sleeptest","pipeline":"wait-no-output-6632","session":"22.62.2","team":"main"}}
{"timestamp":"2021-03-05T02:30:11.371671300Z","level":"info","source":"atc","message":"atc.tracker.notify.run.finish.errored","data":{"build":"2","build_id":2,"error":"Backend error: Exit status: 500, message: {\"Type\":\"\",\"Message\":\"load proc: process does not exist task: not found: unknown\",\"Handle\":\"\",\"ProcessID\":\"\",\"Binary\":\"\"}\n","job":"sleeptest","pipeline":"wait-no-output-6632","session":"22.62.2.9","team":"main"}}
{"timestamp":"2021-03-05T02:30:11.374923000Z","level":"debug","source":"atc","message":"atc.tracker.notify.run.lock.release.released","data":{"build":"2","build_id":2,"id":[1,2],"job":"sleeptest","pipeline":"wait-no-output-6632","session":"22.62.2.1.2","team":"main"}}

Going to dig into this more later today.

Oh, try this one - it doesn’t kill the processor - @taylorsilva

jobs:
- name: sleeptest
  plan:
  - task: sleeptest
    privileged: true
    config:
      platform: linux
      image_resource:
        type: docker-image
        source:
          #repository: cvglpnexus01.td.afg:8082/hub.docker.intel.com/swce-docker/pipeline-base
          repository: cvglpnexus01.td.afg:8082/bash
          tag: latest
          insecure_registries:
            - cvglpnexus01.td.afg:8082
          password: ((nexus_credential.password))
          username: ((nexus_credential.username))
      run:
        path: /bin/sh
        args:
        - -c
        - |
          echo Starting
          touch inputfile
          date >> inputfile
          tail -f inputfile
          date

Sure, I’ll let it run longer. Had to do other work so that’s why I eventually brought it down. Running it also turns my mac into a nice space heater 😝

New information. Apparently, my initial 1 hour assertion is false (or incomplete). Analyzing repeated failures of the above pipeline show a pattern where the pipeline is terminated with the above error at the same minute/second (different hour) every occurrence. At 43 minutes and 51 seconds after the hour, each of these runs was terminated.

That is ODD!

538487  sleeptest/sleeptest/14                errored    2021-03-04@01:51:04-0500  2021-03-04@05:43:51-0500  3h52m47s  platform-proving-grounds  <redacted email>
525383  sleeptest/sleeptest/13                errored    2021-03-03@17:55:16-0500  2021-03-03@21:43:50-0500  3h48m34s  platform-proving-grounds  <redacted email>
520273  sleeptest/sleeptest/12                errored    2021-03-03@14:50:27-0500  2021-03-03@17:43:51-0500  2h53m24s  platform-proving-grounds  <redacted email>