moby: Panic in libnetwork during daemon start (panic: runtime error: index out of range [0] with length 0)
Description
_From https://forums.docker.com/t/dockerd-crashing-in-libnetwork-on-launch/134710_
I’ve recently started seeing a weird dockerd crash in libnetwork on start. I tried downgrading, but that didn’t help. I also tried uninstalling the Nvidia container tools. The only solution that fixed the crash was completely wiping away /var/lib/docker. The only container I am running on this daemon is Jellyfin. However, the problem returns on system reboot.
Reproduce
$ sudo dockerd
INFO[2023-02-19T17:29:03.455357713-05:00] Starting up
INFO[2023-02-19T17:29:03.456995192-05:00] [core] [Channel #1] Channel created module=grpc
INFO[2023-02-19T17:29:03.457036179-05:00] [core] [Channel #1] original dial target is: "unix:///run/containerd/containerd.sock" module=grpc
INFO[2023-02-19T17:29:03.457075584-05:00] [core] [Channel #1] parsed dial target is: {Scheme:unix Authority: Endpoint:run/containerd/containerd.sock URL:{Scheme:unix Opaque: User: Host: Path:/run/containerd/containerd.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}} module=grpc
INFO[2023-02-19T17:29:03.457099058-05:00] [core] [Channel #1] Channel authority set to "localhost" module=grpc
INFO[2023-02-19T17:29:03.457237128-05:00] [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "/run/containerd/containerd.sock",
"ServerName": "",
"Attributes": {},
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses) module=grpc
INFO[2023-02-19T17:29:03.457335134-05:00] [core] [Channel #1] Channel switches to new LB policy "pick_first" module=grpc
INFO[2023-02-19T17:29:03.457397451-05:00] [core] [Channel #1 SubChannel #2] Subchannel created module=grpc
INFO[2023-02-19T17:29:03.457475959-05:00] [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING module=grpc
INFO[2023-02-19T17:29:03.457505585-05:00] [core] [Channel #1 SubChannel #2] Subchannel picks a new address "/run/containerd/containerd.sock" to connect module=grpc
INFO[2023-02-19T17:29:03.457674493-05:00] [core] [Channel #1] Channel Connectivity change to CONNECTING module=grpc
INFO[2023-02-19T17:29:03.457919766-05:00] [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY module=grpc
INFO[2023-02-19T17:29:03.457944513-05:00] [core] [Channel #1] Channel Connectivity change to READY module=grpc
INFO[2023-02-19T17:29:03.458479222-05:00] [core] [Channel #4] Channel created module=grpc
INFO[2023-02-19T17:29:03.458491565-05:00] [core] [Channel #4] original dial target is: "unix:///run/containerd/containerd.sock" module=grpc
INFO[2023-02-19T17:29:03.458502737-05:00] [core] [Channel #4] parsed dial target is: {Scheme:unix Authority: Endpoint:run/containerd/containerd.sock URL:{Scheme:unix Opaque: User: Host: Path:/run/containerd/containerd.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}} module=grpc
INFO[2023-02-19T17:29:03.458510812-05:00] [core] [Channel #4] Channel authority set to "localhost" module=grpc
INFO[2023-02-19T17:29:03.458532703-05:00] [core] [Channel #4] Resolver state updated: {
"Addresses": [
{
"Addr": "/run/containerd/containerd.sock",
"ServerName": "",
"Attributes": {},
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses) module=grpc
INFO[2023-02-19T17:29:03.458550045-05:00] [core] [Channel #4] Channel switches to new LB policy "pick_first" module=grpc
INFO[2023-02-19T17:29:03.458564683-05:00] [core] [Channel #4 SubChannel #5] Subchannel created module=grpc
INFO[2023-02-19T17:29:03.458578569-05:00] [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to CONNECTING module=grpc
INFO[2023-02-19T17:29:03.458598016-05:00] [core] [Channel #4 SubChannel #5] Subchannel picks a new address "/run/containerd/containerd.sock" to connect module=grpc
INFO[2023-02-19T17:29:03.458637110-05:00] [core] [Channel #4] Channel Connectivity change to CONNECTING module=grpc
INFO[2023-02-19T17:29:03.458767345-05:00] [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to READY module=grpc
INFO[2023-02-19T17:29:03.458800679-05:00] [core] [Channel #4] Channel Connectivity change to READY module=grpc
INFO[2023-02-19T17:29:03.474131360-05:00] [graphdriver] using prior storage driver: overlay2
INFO[2023-02-19T17:29:03.479738078-05:00] Loading containers: start.
INFO[2023-02-19T17:29:03.672352245-05:00] failed to read ipv6 net.ipv6.conf.<bridge>.accept_ra bridge=docker0 syspath=/proc/sys/net/ipv6/conf/docker0/accept_ra
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
github.com/docker/docker/libnetwork.(*controller).reservePools(0x5626885d0300?)
/go/src/github.com/docker/docker/libnetwork/controller.go:862 +0xb65
github.com/docker/docker/libnetwork.New({0xc000628230, 0x8, 0xe})
/go/src/github.com/docker/docker/libnetwork/controller.go:233 +0x6bc
github.com/docker/docker/daemon.(*Daemon).initNetworkController(0xc000b3c000, 0xc000ccd590)
/go/src/github.com/docker/docker/daemon/daemon_unix.go:851 +0x4e
github.com/docker/docker/daemon.(*Daemon).restore(0xc000b3c000)
/go/src/github.com/docker/docker/daemon/daemon.go:478 +0x51d
github.com/docker/docker/daemon.NewDaemon({0x5626885e4b38?, 0xc000ea6640}, 0xc0008f5b80, 0xc0005d2090)
/go/src/github.com/docker/docker/daemon/daemon.go:1085 +0x2b79
main.(*DaemonCli).start(0xc000cb8420, 0xc000b39490)
/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:200 +0x9f6
main.runDaemon(...)
/go/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:14
main.newDaemonCommand.func1(0xc000960c00?, {0x562689965730?, 0x0?, 0x0?})
/go/src/github.com/docker/docker/cmd/dockerd/docker.go:38 +0x5e
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0xc000960c00, {0xc000052240, 0x0, 0x0})
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:916 +0x862
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc000960c00)
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:1044 +0x3bd
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:968
main.main()
/go/src/github.com/docker/docker/cmd/dockerd/docker.go:102 +0x15d
Expected behavior
No response
docker version
Client: Docker Engine - Community
Version: 23.0.1
API version: 1.42
Go version: go1.19.5
Git commit: a5ee5b1
Built: Thu Feb 9 19:47:01 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 23.0.1
API version: 1.42 (minimum version 1.12)
Go version: go1.19.5
Git commit: bc3805a
Built: Thu Feb 9 19:47:01 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.18
GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
$ docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.16.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 23.0.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-60-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 10
Total Memory: 11.41GiB
Name: carthage.lan
ID: ZBCI:OVG2:MZ4Z:YKDJ:Z3T4:EG2G:NUWN:OQ2J:AIZW:62EC:IVJL:BZGJ
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional Info
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 20 (12 by maintainers)
Yes, that’s as good as we can get without modifying docker-compose or dockerd. There’s no sign of the “failed to remove network” type error I was hoping for, which disproves my hypothesis. But I think I know what’s actually going on.
It looks like `ross-river` is doing everything twice, concurrently
In particular, it appears to be invoking `docker-compose down` twice
The logs are consistent with two instances of
docker-compose down
racing to tear down the stack.Racing to delete the `sdp_backend` network:
The “Calling DELETE…” debug logs from
dockerd
mark the API request being received, and theurllib3.connectionpool._make_request
debug logs fromdocker-compose
(viaross-river
) mark the API response being received. Note the HTTP status code.It looks like (the
docker-compose
process whose logs are being relayed through) thread 13920 wins the race to delete the network. While the daemon is working on deleting the network, 13921 also tries to delete it and gets a 404 response. A short time later the network delete which 13920 requested completes, receiving an HTTP 204 response.The logs around deleting the `sdp_servicebus` network might be the smoking gun.
Unlike the
sdp_backend
deletion, there is no HTTP 204 response for deletingsdp_servicebus
. There is, however, this log from the same thread ID as the race winner:What I think is happening is that there are two actors concurrently racing to do the equivalent of
docker-compose down; systemctl stop docker.service
, which is resulting in the daemon being shut down while one of the networks is still in the process of being deleted. The daemon is not shutting down as gracefully as intended—exiting before all in-flight API requests complete—killing the network delete operation before it has the chance to delete the network object from the datastore. If true, there are two bugs in the daemon:To work around these issues, stop racing multiple
docker-compose down
instances or at least ensure that all instances run to completion before shutting down the daemon.