concourse: Performance regression with 'overlay' driver and privileged containers

Bug Report

  • Concourse version: 3.3.1
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS
  • Browser (if applicable): chrome
  • Did this used to work? Yes

We are noticing that various tasks are hanging for 1-10m before executing. (the ui doesn’t show the ‘loading’ icon spinning, just hangs)

Our setup:

  • 2 workers with 16gb RAM and 4 cores, 1 web with 8gb and 4 cores (IIRC)
  • Postgres db lives on the web VM
  • Workers and web never get above 10% CPU
  • Three pipelines with < 5 jobs, 1 pipeline with ~20 jobs, some of which have 3000+ builds (most are low hundreds though)

Things we have tried:

This was also seen by @krishicks in the Slack concourse#general channel, I believe.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 24 (12 by maintainers)

Commits related to this issue

Most upvoted comments

For anyone not following along in #1966, we found and fixed the source of a lot of the btrfs instability that led us to switching to overlay in the first place. We can now more easily recommend that people switch back to it with the next release of Concourse (3.9), and we’ll consider changing the default in the future. Until we either change the default or find a way to improve the overlay performance (not likely), I’ll leave this issue open.

I see that issue on my concourse setup, version 4.2.1. Is there any information that I should provide?

We have the same issue with 4.2.1 with the binary release.

I still see this issue in 3.14.1 even after switching to btrfs driver… the job just keeps waiting and sometimes has ‘waiting for docker to come up…’ for more than 15 minutes…

We did see improvement when switching back.

I had forgotten about this issue and realized it was affecting me on a new Concourse deploy. Thanks for reminding me to switch drivers! On Thu, Dec 28, 2017 at 09:56 Timothy R. Chavez notifications@github.com wrote:

Did the folk that switched back to btrfs notice a performance improvement? Have you noticed increased instability? cc: @krishicks https://github.com/krishicks, @jadekler https://github.com/jadekler

FWIW the slowness and the lack of feedback drive dev-folk here bonkers.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/concourse/concourse/issues/1404#issuecomment-354328434, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE1_b0JHOcZuTwx5Q-jTahjHWPVg-lzks5tE9ZcgaJpZM4OeaDJ .

With the switch to overlay by default in 3.1, the tradeoff is that privileged tasks/resources (i.e. the Docker Image resource) have a performance penalty.

Unfortunately with container tech as it is today, you either get instability (btrfs) or slowness (overlay). We chose the latter.

Here are a few paths forward:

  • Improve the UI feedback, so you at least know what it’s doing and can know it’s not stuck (so more than just a spinner). We’ve been generally in need of finer-grained progress indicators on the build page. (/cc @jma @Lindsayauchin)
  • Look for optimizations in the runtime (/cc @topherbullock).
  • Pray to the kernel gods that shiftfs gets merged and we can kill this nasty performance overhead.
  • Add the ability to limit the inputs to a put, if it’s the case that it’s taking so long to transfer data that isn’t actually needed by the task/put. See https://github.com/concourse/concourse/issues/1202

I was using the following and DID NOT see this issue:

  • Concourse 3.0.1 (standalone binaries)
  • Google Compute Engine VMs (Ubuntu 14.04)
  • baggage claim driver: btrfs

I recently upgraded to Concourse 3.3.4, and I DO see this issue.

I upgraded to Concourse 3.1.1 in between 3.0.1 and 3.3.4 and I don’t believe I saw the issue, but I could be mistaken.

Note: we are using the default baggage claim driver: overlay fs in 3.3.4

It seems like certain Jobs are affected by it (more so than others).

I have one job that builds a docker image (see below)

All the tasks run quick, with no lag… but when it gets to the docker-image-resource “put” task for “app-image” it takes 6+ minutes for it to start spinning…

During that 6+ minutes, it visually just looks like it has not started.

I also see the there is a container listed for that task when I do a fly containers However I cannot intercept into it… (get a ssh bad handshake… so I assume its because it has NOT yet spun up the container ? )

- name: build-image-backend
  serial_groups: [build-docker]
  plan:
  - aggregate:
    - get: ci-runner-image
    - get: ci-scripts
    - get: ci-secrets
    - get: backend-version
      params: {bump: patch}
    - get: backend-dev
      passed: [unit-tests-backend]
      trigger: true
  - aggregate:
    - get: base-backend-image
      params: {save: true}
      passed: [build-base-backend]
    - task: prep-docker-image
      image: ci-runner-image
      file: ci-scripts/tasks/backend/prep-docker-backend.yml
  - put: backend-image
    params:
      load_base: base-backend-image
      build: output
      tag: backend-version/number
      tag_as_latest: true
    get_params:
      skip_download: true
  - put: backend-version
    params:
      file: backend-version/number

Before: image

After: image

During this 6+ minutes, I did a “watch” on fly volumes for the container handle that is associated with that put task. I see a few container entries for fly containers for that handle… then after about 6 minutes, I do see a new volume entry show up in the watch fly volumes | grep the-container-handle … so seems like its related to fs or volumes?

It would be great to get this fixed… also if there is something I could look into… logs or anything to help track down, would be appreciated.

We got alot of gains with the new caches feature, but lost them with some CI pipelines w/ this noticeable lag 😃

Cheers