aardvark-dns: DNS requests timeout
I am using podman as a docker replacement on our gitlab-runner host. I have a 40 containers concurrency limit and when I start my tests, I get DNS resolution errors.
Testing environment:
- 16 vCPUs
- 24GB memory
- CentOS 9 stream
- podman 4.6.1-5.
- slirp4netns: slirp4netns-1.2.2-1.el9.x86_64
- aardvark-dns: aardvark-dns-1.7.0-1.el9
While running tests, I get random dns resolution fail errors inside containers (actual host replaced with host.example.tld):
Example 1:
Cloning into 'spec/fixtures/modules/yumrepo_core'...
ssh: Could not resolve hostname host.example.tld: Temporary failure in name resolution
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Example 2:
$ bundle install -j $(nproc)
Fetching gem metadata from https://host.example.tld/nexus/repository/GroupRubyGems/..
Fetching gem metadata from https://host.example.tld/nexus/repository/GroupRubyGems/..
Could not find gem 'beaker (~> 5)' in any of the gem sources listed in your
Example 3:
Initialized empty Git repository in /builds/puppet/freeradius/.git/
Created fresh repository.
fatal: unable to access 'https://host.example.tld/puppet/freeradius.git/': Could not resolve host: host.example.tld
Cleaning up project directory and file based variables
This does not happen in every container, it’s sporadic and random. If I switch back to cni
backend, it works without errors.
I tried running up to 8 containers and flooding the dns server with dns lookups, but I could not get a DNS resolution error. Will try to ramp that up to 30-40 and see if I can reproduce.
If anyone has an idea how to debug this, I will gladly look into it if my knowledge allows me.
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 5
- Comments: 24 (10 by maintainers)
We have been hit by this issue as well using Podman with GitLab CI containers. It seems that even when using the FF_NETWORK_PER_BUILD flag in GitLab CI to use a separate network between containers, a single aardvark-dns process ends up being used and experiences this issue.
Hi,
I am experiencing the same symptoms described by @matejzero under similar conditions: I use one GitLab Runner with Docker executor (which is configured to use a rootless and unprivileged Podman socket). I experience many DNS resolution failures where an estimated half of all CI jobs fail due to this issue.
Troubleshooting steps:
This pretty much leaves only the container network stack as the potential cause.
When GitLab Runner jobs are started and stopped, I can see bursts of the following log lines in journald:
I can reproduce these messages using a custom test job, however I found that the presence of one of these messages is not sufficient to cause a DNS resolution failure.