moby: docker containers experience packet loss or slow network (EAI_AGAIN in application)

I’ve had a similar issue to #12364 and/or #11407 where my production node.js apps inside the container simply stopped responding. I thought the circumstances however were different enough to warrant a separate issue.

Once a container stops responding they will not respond again until the docker daemon is restarted. Not all containers will stop responding at the same time. I’m led to the conclusion that this is a docker issue rather than node because I have several different services running on the same server and all of them experience apparent heavy packet loss hours before the error below occurs which eventually seems to crash node in turn. Restarting the docker daemon cleared up the error and also the packet loss. The exception caught in the app is interesting in that is a seldom occurring EAI_AGAIN error (DNS temporary failure) (Also, not temporary in this case) which led me to believe it could be related to #12364 and/or #11407.

Errors I am seeing in the node app

2015-04-21T08:26:21.415Z - info: --> method:[GET] url:[/status] status:[200] time:[1ms]
events.js:85
      throw er; // Unhandled 'error' event
            ^
Error: getaddrinfo EAI_AGAIN
    at Object.exports._errnoException (util.js:742:11)
    at errnoException (dns.js:46:15)
    at Object.onlookup [as oncomplete] (dns.js:91:26)

Details

[root@server1 root]# docker --version
Docker version 1.5.0, build a8a31ef/1.5.0

[root@server1 root]# uname -a
Linux server1 3.14.20-20.44.amzn1.x86_64 #1 SMP Mon Oct 6 22:52:46 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Thank you

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Reactions: 12
  • Comments: 48 (9 by maintainers)

Most upvoted comments

Thank you guys for share , I had the same problem and it is solved using service docker restart

+1 @HudsonAkridge have same network issues if send/receive a lot of packets. We always get it in unexpected places when run the tests.

Dear @aboch and @thaJeztah, please if you can join to conversation. It very bad bug, and we sense that every day.

All hosts have latest stable Docker version, and have same issues.

That hosts, are a standalone sandbox servers, for test, without swarm and other connections.

We start tests from Windows host where placed that Docker to application inside docker.

docker info

Containers: 12
 Running: 0
 Paused: 0
 Stopped: 12
Images: 1803
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 1408
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.12-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 11.2 GiB
Name: moby
ID: 7RMP:T34G:KAEO:WHV5:3XDD:352G:63VB:S3OS:RZUU:U7US:ZI7S:ZJUS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

docker version


Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:40:59 2017
 OS/Arch:      windows/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 07:52:04 2017
 OS/Arch:      linux/amd64
 Experimental: false

Not sure guys if this is a good idea, but just noticed that setting docker0 Mac-address to eth0 value like described here dramatically improves network stuff. E.g. having eth0 Link encap:Ethernet HWaddr 00:15:b2:a9:8f:6e I did:

# ip link set docker0 address 00:15:b2:a9:8f:6e

My environment:

root@selenium-cloud-m-6:~# docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:06 2017
 OS/Arch:      linux/amd64
 Experimental: false

In my case LRO was turned off. Hope it helps someone.

We’re also experiencing a very similar issue between the bridge network. Several large requests will go through quickly, then the next one or two requests will hang for a long while trying to establish a connection to the other docker container, then my suspicion is something in docker is restarted or refreshed, and we get 5 or 6 more large requests through quickly, then another hang until the same restart point is reached.

Very cumbersome to deal with, and not something that restarting the docker daemon is unacceptable for our situation to solve.

Great troubleshooting, @randunel . Thanks a lot - helped me figure out the situation as we’re also experiencing absolutely the same situation with the very same straces/tcpdump captures as shown above.

Anyone heard about any plans for having this addressed? Strange thing is that there are no updates from August,2016 considering that DNS resolution is so unreliable that it cannot be trusted at all.

@angeloluciani that is more a workaround then a solution…

@thaJeztah I am still experiencing this issue with 1.9.1. Don’t have my hands on 1.10.0 yet.

The issue seems to have surfaced as an issue in the iptables-based bridging. Something happens where some containers simply stop sending valid outgoing traffic. tcpdump didn’t really offer conclusive diagnostics. One work-around I found was to run iptables-save, restart iptables (flush), then iptables-restore. The networking to the container resumes without restarting the container instance.

I have not tried the dummy mac address workaround so I can’t confirm that works. This is difficult to troubleshoot because as soon as we poke and prod, the services revive. Here’s what’s brought it back to life.

  • docker restart <container>
  • iptables-save tmpfile; service iptables restart; iptable-restore
  • service docker restart (more shotgun approach to the docker container restart)

I think the common thing between these is that docker reconfigures the iptables in each case. If it’s not an docker issue directly (iptables) I believe this is a docker networking design issue and should be looked into at the very least.

FWIW we were able to eliminate DNS issues we had on larger machines that performed a high rate of DNS resolution by running a local cache (dnsmasq). It’s been a few months, but I believe this was related to some kind of port exhaustion. These would stack up in the conntrack table and once hit 65Kish things would go south.

@brudrafon Thanks for the info.

The symptoms here are very subtly different than the article you posted. The article describes the symptoms as the host losing connectivity. In this case, the docker containers are the instances that lose connectivity, not the host.

For example, I have a single host running 5 containers container a on port 3000 container b on port 3001 etc…

The symptoms are any one random container is suddenly unable to make outgoing calls in a timely fashion. The container is receiving requests but is not able to respond because it is unable to communicate with dependent services (e.g. DNS, DB, etc.) DNS being the most obvious example. Some, none, or all containers may or may not fail at the same time. All while this is happening, connectivity to/from the host remains unaffected. Restarting the docker daemon works around the issue temporarily but only delays the inevitable.

I will give the info in the article a try though but even if it helps it is still just a workaround. Docker really must either fix this in their project or champion the upstream fix in the bridge driver.