concourse: Network latency from web to postgres causes slow initialization in task and put steps
Bug Report
When there is a slow connection from the web node(s) to the database, this causes the “initialization” part of task and put steps to slow down quite a bit. In one setup where the web and database nodes were in different data centers, we saw the initialization period going up to a minute for each task and put step in a job. When the web and database nodes were placed in the same data center, the initialization period went down to 5 seconds.
Steps to Reproduce
This can be reproduced on master using the docker-compose setup, but pointing the database to a remote database. I’ve tried this with postgres on AWS M5.large instances, AWS RDS in us-east-1 and eu-west-1, as well as GCP CloudSQL in europe-north1.
I used this pipeline for the tests:
---
resources:
- name: git-repo
type: git
source:
uri: https://github.com/concourse/git-resource
branch: master
jobs:
- name: test
plan:
- get: git-repo
- task: print
config:
platform: linux
image_resource:
type: docker-image
source: {repository: busybox}
run:
path: sh
args:
- -exc
- |
lssdf
attempts: 2
When the database was on my local machine, the initialization took about 3 seconds. When it was in us-east-1, latencies were around 30 ms to the database, and the initialization took about 9 seconds. When it was in Europe, latencies were around 100ms, and the initialization took 20 seconds.
Additional Context
I suspect that network latency to the database leads to some type of compounding effect in Concourse with the number of queries and the mutexes protecting each database access. I tried adding runtime.SetMutexProfileFraction(1) to the ATC, and captured some of the info which I’ve made available here. Also, in the web logs, I noticed lock.acquire.not-acquired-already-held-locally popped up a few times.
Version Info
- Concourse version: v5.4.0
- Deployment type (BOSH/Docker/binary): Docker, Kubernetes, binary
- Infrastructure/IaaS:
- Browser (if applicable):
- Did this used to work?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 15 (12 by maintainers)
Our installation also experienced this issue. Through retrieving traces I was able to generate a report showing when goroutines were blocked on synchronization (as well as blocking for other reasons). The crux seemed to be in the LockFactory code (
atc/db/lock/lock.go), where many routines were waiting for theacquireMutex.To summarize the code:
lockRepomutex).lockDBmember of the factory. It uses a dedicated connection pool with one connection, and is guarded by another mutex.acquireMutexof the LockFactory. Once they have it they can also get thelockDB’s mutex. After a database round-trip, the mutexes are released. (Along the way, they get thelockRepomutex a couple of times in order to check and then register in the list of locks.) Releasing a lock works in the same way.Therefore, if database access is “slow”, then whoever is holding the
acquireMutexwill have it for a long time, while everybody else waits. In practice, what we saw was that there were a lot of pipeline scheduler goroutines trying to grab it for scheduling (theAcquireSchedulingLockpath), and a smaller number of step-exec goroutines trying to grab it for volume management. Those were mostly unlucky, causing visible delays in step initialization of several minutes per step. The other routines were also mostly unlucky, of course, but most pipelines aren’t running most of the time, and so their delays were not really visible.While we were able to remedy this operationally by reducing latency between web and DB nodes, the logical bottleneck is still present. People could experience this again due to transient network conditions, or perhaps DB slowness.
The overall locking design is not entirely clear to me, but it seems there might be some lower-hanging fruit available by:
lockDBmutex is also not needed?runtime.SetMutexProfileFractionat some early point in ATC’s life, so that pprof can generate mutex reports directly, rather than us having to observe lock contention by indirect means.I re-tested this by removing the resource and
getstep from the sample pipeline above, but the initialization still takes a long time.Also, I found a much easier way to simulate network latency on the web -> db path in docker-compose rather than spinning up a remote database. 😂 After identifying which
vethXXXXXnetwork interface corresponds to the web container, run these on the host machine:They are always co-located with the ATC, for example, different containers within the same network for Docker testing, or on different VMs within the same availability zone when testing with the binary. I have not tried testing with the ATC and worker on the same VM though.