concourse: Aggregate issue: 'too many open files'

Please provide:

  • Which component is emitting this?
  • What files are open? (lsof -p <process id>)
  • Concourse version
  • Deployment type (e.g. binary, Bosh, K8s, Docker)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

We’ve also been seeing this issue across multiple versions, particularly noticeable post upgrading to 4.0; we’ve currently paused at 4.1. We reached a stage where our cluster kept falling over on a daily basis and adding more workers prolonged this to the point that now we’re falling over every three days. Open files exhaustion is the culprit and we’ll also try increasing instance type and ulimit for open files drastically, and feedback. Really don’t like the idea of downgrading major versions…

The fact that this is still an issue makes me feel sad. What needs to happen for this stop being an issue? Would bpm work?

We’re seeing this behaviour on the web instance which run atc

Concourse version: 4.2.1

We’ve had to resort to bosh sshing in and amending ulimit soft to higher inside the monit job which starts atc from the current 1024 limit. Once we’d restarted the job via monit, it picked up the new ulimit and has worked nicely.

However, this is not a proper fix as a bosh restart/recreate/upgrade will cause our customisations to be killed.

We did try using the ulimit-release but there appears to be some sort of race condition as the settings were applied to the wrong monit job

From our side, it would be much easier if we could set this via bosh manifest which then get’s cascaded down to the atc process. Like this … https://github.com/concourse/concourse/pull/1530/commits/e9e23a17f87decc571a7851b17759463f58d3518 from @gerhard

Bit of feedback on what we did which vastly improved uptime. Using 4.2.1 now, at scale.

  1. Disable syslog config on the web nodes. Our syslog servers are a bit rubbish so don’t accept logs sometimes

  2. Revert to overlayfs instead of btrfs for baggagereclaim/worker nodes https://bosh.io/jobs/baggageclaim?source=github.com/concourse/concourse-bosh-release&version=4.2.1#p%3Ddriver

  3. Ensure that garden-runc-release v1.17.1 is used as it fixes the #1884 bug https://github.com/cloudfoundry/garden-runc-release/releases/tag/v1.17.1

After upgrading to 4.x team RabbitMQ has to reap and recreate workers every few days, otherwise no jobs get scheduled. Now that we’ve found similar reports that point at file descriptors we’ll try to lsof around on worker VMs next time it happens.

Ran into this as well now; stalling worker with the following log lines (on the worker):

2018/10/19 13:13:10 http: Accept error: accept tcp [::]:7777: accept4: too many open files; retrying in 1s
2018/10/19 13:13:10 http: Accept error: accept tcp [::]:7788: accept4: too many open files; retrying in 1s

Could not find any log lines in the ATC related to the above (it’s configured to only log errors). When running lsof -i on the worker in question, I can see many connections in CLOSE_WAIT state (example below):

concourse 2650   concourse  333u  IPv6 71027808      0t0  TCP ip-10-9-14-54.eu-west-1.compute.internal:7788->ip-10-9-18-210.eu-west-1.compute.internal:36158 (CLOSE_WAIT)
concourse 2650   concourse  334u  IPv6 71026430      0t0  TCP ip-10-9-14-54.eu-west-1.compute.internal:7788->ip-10-9-18-210.eu-west-1.compute.internal:33218 (CLOSE_WAIT)
concourse 2650   concourse  336u  IPv6 71027045      0t0  TCP ip-10-9-14-54.eu-west-1.compute.internal:7788->ip-10-9-18-210.eu-west-1.compute.internal:34810 (CLOSE_WAIT)

So Concourse workers are leaking sockets somehow…

Due to the exhaustion issue causing / instabilities / unreliability of one of our core systems (concourse), we just downgraded to 3.14.1 which doesn’t appear to exhaust itself (or if it does, it does it at a much slower rate)

After moving the (single) worker to bare metal, the situation did not improve at all. We went back to the dockerized version creating a separate concourse worker per team, all hosted on a single host, which resolved the open fd issue.

It seems like the advice of one worker = one physical machine does not work well for bigger hosts and larger team/pipelines setups. Something seems to grow faster than linear with more pipelines/resources/teams.