concourse: Network latency from web to postgres causes slow initialization in task and put steps

Bug Report

When there is a slow connection from the web node(s) to the database, this causes the “initialization” part of task and put steps to slow down quite a bit. In one setup where the web and database nodes were in different data centers, we saw the initialization period going up to a minute for each task and put step in a job. When the web and database nodes were placed in the same data center, the initialization period went down to 5 seconds.

Steps to Reproduce

This can be reproduced on master using the docker-compose setup, but pointing the database to a remote database. I’ve tried this with postgres on AWS M5.large instances, AWS RDS in us-east-1 and eu-west-1, as well as GCP CloudSQL in europe-north1.

I used this pipeline for the tests:

---
resources:
- name: git-repo
  type: git
  source:
    uri: https://github.com/concourse/git-resource
    branch: master

jobs:
- name: test
  plan:
  - get: git-repo
  - task: print
    config:
      platform: linux
      image_resource:
        type: docker-image
        source: {repository: busybox}
      run:
        path: sh
        args:
          - -exc
          - |
            lssdf
    attempts: 2

When the database was on my local machine, the initialization took about 3 seconds. When it was in us-east-1, latencies were around 30 ms to the database, and the initialization took about 9 seconds. When it was in Europe, latencies were around 100ms, and the initialization took 20 seconds.

Additional Context

I suspect that network latency to the database leads to some type of compounding effect in Concourse with the number of queries and the mutexes protecting each database access. I tried adding runtime.SetMutexProfileFraction(1) to the ATC, and captured some of the info which I’ve made available here. Also, in the web logs, I noticed lock.acquire.not-acquired-already-held-locally popped up a few times.

Version Info

  • Concourse version: v5.4.0
  • Deployment type (BOSH/Docker/binary): Docker, Kubernetes, binary
  • Infrastructure/IaaS:
  • Browser (if applicable):
  • Did this used to work?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 15 (12 by maintainers)

Most upvoted comments

Our installation also experienced this issue. Through retrieving traces I was able to generate a report showing when goroutines were blocked on synchronization (as well as blocking for other reasons). The crux seemed to be in the LockFactory code (atc/db/lock/lock.go), where many routines were waiting for the acquireMutex.

To summarize the code:

  • Various DB-accessing bits of ATC want to acquire locks, such as a pipeline-scheduling lock or a volume-creating lock.
  • This is the responsibility of the LockFactory, an effectively global object which maintains a map of all the locks that there are. Since maps in go are not safe for concurrent access, it’s guarded by a mutex (the lockRepo mutex).
  • Additionally, lock acquisition is registered in the database, via Postgres “advisory locks”. This happens in a lockDB member of the factory. It uses a dedicated connection pool with one connection, and is guarded by another mutex.
  • So whenever anybody wants a lock, they first have to gain the acquireMutex of the LockFactory. Once they have it they can also get the lockDB’s mutex. After a database round-trip, the mutexes are released. (Along the way, they get the lockRepo mutex a couple of times in order to check and then register in the list of locks.) Releasing a lock works in the same way.

Therefore, if database access is “slow”, then whoever is holding the acquireMutex will have it for a long time, while everybody else waits. In practice, what we saw was that there were a lot of pipeline scheduler goroutines trying to grab it for scheduling (the AcquireSchedulingLock path), and a smaller number of step-exec goroutines trying to grab it for volume management. Those were mostly unlucky, causing visible delays in step initialization of several minutes per step. The other routines were also mostly unlucky, of course, but most pipelines aren’t running most of the time, and so their delays were not really visible.

While we were able to remedy this operationally by reducing latency between web and DB nodes, the logical bottleneck is still present. People could experience this again due to transient network conditions, or perhaps DB slowness.

The overall locking design is not entirely clear to me, but it seems there might be some lower-hanging fruit available by:

  1. Instead of using a global lock factory, split it into more than one used for different purposes. The lock types do not collide, so there should be no need to serialize their acquisition and they can be held in separate maps. Then the step-exec processes can proceed without having to be blocked by other pipelines trying to get scheduling locks.
  2. Replace the map-and-mutex combo with a lock-free data structure, so that multiple processes could acquire/release locks at the same time. Perhaps the lockDB mutex is also not needed?
  3. Invoke runtime.SetMutexProfileFraction at some early point in ATC’s life, so that pprof can generate mutex reports directly, rather than us having to observe lock contention by indirect means.

I re-tested this by removing the resource and get step from the sample pipeline above, but the initialization still takes a long time.

Also, I found a much easier way to simulate network latency on the web -> db path in docker-compose rather than spinning up a remote database. 😂 After identifying which vethXXXXX network interface corresponds to the web container, run these on the host machine:

# introduce 50ms latency on web container
sudo tc qdisc add dev vethXXXXX root netem delay 50ms 10ms distribution normal
# remove the latency when done testing
sudo tc qdisc del dev vethXXXXX root netem

They are always co-located with the ATC, for example, different containers within the same network for Docker testing, or on different VMs within the same availability zone when testing with the binary. I have not tried testing with the ATC and worker on the same VM though.