aardvark-dns: DNS requests timeout

I am using podman as a docker replacement on our gitlab-runner host. I have a 40 containers concurrency limit and when I start my tests, I get DNS resolution errors.

Testing environment:

  • 16 vCPUs
  • 24GB memory
  • CentOS 9 stream
  • podman 4.6.1-5.
  • slirp4netns: slirp4netns-1.2.2-1.el9.x86_64
  • aardvark-dns: aardvark-dns-1.7.0-1.el9

While running tests, I get random dns resolution fail errors inside containers (actual host replaced with host.example.tld):

Example 1:
Cloning into 'spec/fixtures/modules/yumrepo_core'...
ssh: Could not resolve hostname host.example.tld: Temporary failure in name resolution
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

Example 2:
$ bundle install -j $(nproc)
Fetching gem metadata from https://host.example.tld/nexus/repository/GroupRubyGems/..
Fetching gem metadata from https://host.example.tld/nexus/repository/GroupRubyGems/..
Could not find gem 'beaker (~> 5)' in any of the gem sources listed in your

Example 3:
Initialized empty Git repository in /builds/puppet/freeradius/.git/
Created fresh repository.
fatal: unable to access 'https://host.example.tld/puppet/freeradius.git/': Could not resolve host: host.example.tld
Cleaning up project directory and file based variables

This does not happen in every container, it’s sporadic and random. If I switch back to cni backend, it works without errors.

I tried running up to 8 containers and flooding the dns server with dns lookups, but I could not get a DNS resolution error. Will try to ramp that up to 30-40 and see if I can reproduce.

If anyone has an idea how to debug this, I will gladly look into it if my knowledge allows me.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 5
  • Comments: 24 (10 by maintainers)

Most upvoted comments

We have been hit by this issue as well using Podman with GitLab CI containers. It seems that even when using the FF_NETWORK_PER_BUILD flag in GitLab CI to use a separate network between containers, a single aardvark-dns process ends up being used and experiences this issue.

Hi,

I am experiencing the same symptoms described by @matejzero under similar conditions: I use one GitLab Runner with Docker executor (which is configured to use a rootless and unprivileged Podman socket). I experience many DNS resolution failures where an estimated half of all CI jobs fail due to this issue.

Troubleshooting steps:

  • First I tried to mitigate this by running a local caching DNS on the host, which did not improve the issue.
  • By running a special GitLab CI job that performs hundreds of unique DNS lookups I could confirm that whenever there is a DNS lookup failure, the caching DNS on the host did not even receive the DNS query.
  • Then I assumed that packet loss due to high network load between the host and the containers could be at fault; but even when I limited the incoming network bandwidth of the host, the issue was not improving.

This pretty much leaves only the container network stack as the potential cause.

When GitLab Runner jobs are started and stopped, I can see bursts of the following log lines in journald:

Feb 19 08:29:18 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:19 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:20 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:21 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:21 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:21 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:21 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:22 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:22 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:25 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1
Feb 19 08:29:25 myhostname aardvark-dns[11348]: Received SIGHUP will refresh servers: 1

I can reproduce these messages using a custom test job, however I found that the presence of one of these messages is not sufficient to cause a DNS resolution failure.