moby: Restarting dockerd breaks the embedded DNS server

Description

Ubuntu Unattended-upgrades upgraded docker-ce and friends. As part of the upgrade, dockerd was restarted. Subsequently, running containers could not resolve the name of peers on the same host (Docker bridge networking in use).

Log has many pairs of these two messages

Resolver Start failed for container .... setting up IP table rules failed:  (iptables failed: iptables --wait -t nat -N DOCKER_OUTPUT: iptables: Chain already exists
failed to populate fields for osl sandbox

Reproduce

Create a clean VM, updata all packages, install docker, add ubuntu user to the docker group then reboot

multipass -c 2 -m 4G -d 20G jam jammy
multipass shell jam
sudo apt update
sudo apt upgrade -y
sudo apt install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo   "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" |   sudo tee /etc/apt/sources.list.d/docker.list
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo nano /etc/docker/daemon.json
sudo usermod -a -G docker ubuntu
sudo reboot

daemon.json contents, note “live-restore”: true

{
    "icc": false,
    "log-driver": "syslog",
    "log-opts": {
        "syslog-facility": "local1",
        "tag": "{{.Name}}"
    },
    "live-restore": true,
    "userland-proxy": false,
    "userns-remap": "default",
    "default-address-pools":
        [
            {"base":"192.168.255.0/24","size":29}
        ],
    "bip": "192.168.255.253/30"
}

Following the reboot, start two containers using Docker compose.

docker pull bash:latest
nano dc.yml
docker compose -f dc.yml -p t up -d

Contents of dc.yml

services:
cona:
    entrypoint: ["/bin/sleep","1d"]
    hostname: hosta
    image: bash:latest
    restart: unless-stopped
    networks:
    - neta
    - netb
    security_opt:
    - no-new-privileges:true
conb:
    entrypoint: ["/bin/sleep","1d"]
    hostname: hostb
    image: bash:latest
    restart: unless-stopped
    networks:
    - neta
    - netb
    security_opt:
    - no-new-privileges:true
networks:
neta:
netb:

Prove that the embedded DNS server works

ubuntu@jam:~$ docker exec t-cona-1 ping conb
PING conb (192.168.255.26): 56 data bytes
64 bytes from 192.168.255.26: seq=0 ttl=64 time=0.224 ms
64 bytes from 192.168.255.26: seq=1 ttl=64 time=0.204 ms
^C
ubuntu@jam:~$ docker exec t-conb-1 ping cona
PING cona (192.168.255.27): 56 data bytes
64 bytes from 192.168.255.27: seq=0 ttl=64 time=0.118 ms
64 bytes from 192.168.255.27: seq=1 ttl=64 time=0.159 ms
^C
ubuntu@jam:~$

Restart dockerd and observe the logs

ubuntu@jam:~$ sudo systemctl restart docker
ubuntu@jam:~$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
    Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
    Active: active (running) since Tue 2023-05-30 09:18:16 AEST; 5s ago
TriggeredBy: ● docker.socket
    Docs: https://docs.docker.com
Main PID: 1479 (dockerd)
    Tasks: 9
    Memory: 26.3M
        CPU: 267ms
    CGroup: /system.slice/docker.service
            └─1479 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.369133373+10:00" level=error msg="Resolver Start failed for container b5ceaf056e1a2603abde826490cf3ac2d18d8af6095c64a6716e5e7b544bae86, \"setting up I>
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.369164615+10:00" level=error msg="failed to populate fields for osl sandbox 09d0460ebab447eb650884ff0e4371b86125185b8776db0ef8ba4903e915258a"
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.371111533+10:00" level=error msg="Resolver Start failed for container 11881e3c01dfb4f8244342ec0cc241d7c18250d1d9956d69725a9b4e1d868f0d, \"setting up I>
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.371129838+10:00" level=error msg="failed to populate fields for osl sandbox 0ef5358411a1bbe4d96d4d318b3a31243109436e47a1c6324ec54322be9a63da"
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.371269276+10:00" level=info msg="there are running containers, updated network configuration will not take affect"
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.371445983+10:00" level=info msg="Loading containers: done."
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.385157565+10:00" level=info msg="Docker daemon" commit=659604f graphdriver=overlay2 version=24.0.2
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.385214548+10:00" level=info msg="Daemon has completed initialization"
May 30 09:18:16 jam dockerd[1479]: time="2023-05-30T09:18:16.402196640+10:00" level=info msg="API listen on /run/docker.sock"
May 30 09:18:16 jam systemd[1]: Started Docker Application Container Engine.
ubuntu@jam:~$

Observe that the embedded DNS server is broken

ubuntu@jam:~$ docker exec t-cona-1 ping conb
ping: bad address 'conb'
ubuntu@jam:~$ docker exec t-conb-1 ping cona
ping: bad address 'cona'
ubuntu@jam:~$

Expected behavior

With live-restore=true in play, the embedded DNS server should have operated correctly after the docker daemon restart

docker version

Client: Docker Engine - Community
 Version:           24.0.2
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        cb74dfc
 Built:             Thu May 25 21:51:00 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       659604f
  Built:            Thu May 25 21:51:00 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    24.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.5
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.18.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 24.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: syslog
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  userns
  cgroupns
 Kernel Version: 5.15.0-72-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.828GiB
 Name: jam
 ID: 68a7b3d7-8a6d-4204-921e-ef786a0e7ac3
 Docker Root Dir: /var/lib/docker/165536.165536
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  corunastylis:5000
  127.0.0.0/8
 Live Restore Enabled: true
 Default Address Pools:
   Base: 192.168.255.0/24, Size: 29

Additional Info

We have placed an APT hold on docker-ce docker-ce-cli docker-ce-rootless-extras docker-compose-plugin docker-buildx-plugin until this bug is resolved.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

The fixes have been merged to the 24.0 branch so will be included in the next patch, v24.0.3

I was able to reproduce the Resolver Start failed for container logs on an Ubuntu Jammy multipass instance. The detailed reproduction steps were invaluable; big thank you to @bruceesmith! It was in fact directly related to a second root cause. I have confirmed that the combination of #45654 and #45657 fixes DNS on live-restored containers on Ubuntu Jammy.

Anyway, the root cause is that https://github.com/moby/moby/commit/f0be4d126dacb19a35046536b041a5989ff0dc95 accidentally introduced a functional change which broke sandbox restore.

🤦 Right… had to be a literal + in between 🙈

Ifaces[i.srcName+"+"+i.dstPrefix] = ifaceOptions