moby: docker containers experience packet loss or slow network (EAI_AGAIN in application)
I’ve had a similar issue to #12364 and/or #11407 where my production node.js apps inside the container simply stopped responding. I thought the circumstances however were different enough to warrant a separate issue.
Once a container stops responding they will not respond again until the docker daemon is restarted. Not all containers will stop responding at the same time. I’m led to the conclusion that this is a docker issue rather than node because I have several different services running on the same server and all of them experience apparent heavy packet loss hours before the error below occurs which eventually seems to crash node in turn. Restarting the docker daemon cleared up the error and also the packet loss. The exception caught in the app is interesting in that is a seldom occurring EAI_AGAIN error (DNS temporary failure) (Also, not temporary in this case) which led me to believe it could be related to #12364 and/or #11407.
Errors I am seeing in the node app
2015-04-21T08:26:21.415Z - info: --> method:[GET] url:[/status] status:[200] time:[1ms]
events.js:85
throw er; // Unhandled 'error' event
^
Error: getaddrinfo EAI_AGAIN
at Object.exports._errnoException (util.js:742:11)
at errnoException (dns.js:46:15)
at Object.onlookup [as oncomplete] (dns.js:91:26)
Details
[root@server1 root]# docker --version
Docker version 1.5.0, build a8a31ef/1.5.0
[root@server1 root]# uname -a
Linux server1 3.14.20-20.44.amzn1.x86_64 #1 SMP Mon Oct 6 22:52:46 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Thank you
About this issue
- Original URL
- State: open
- Created 9 years ago
- Reactions: 12
- Comments: 48 (9 by maintainers)
Thank you guys for share , I had the same problem and it is solved using service docker restart
+1 @HudsonAkridge have same network issues if send/receive a lot of packets. We always get it in unexpected places when run the tests.
Dear @aboch and @thaJeztah, please if you can join to conversation. It very bad bug, and we sense that every day.
All hosts have latest stable Docker version, and have same issues.
That hosts, are a standalone sandbox servers, for test, without swarm and other connections.
We start tests from Windows host where placed that Docker to application inside docker.
docker info
docker version
Not sure guys if this is a good idea, but just noticed that setting
docker0
Mac-address toeth0
value like described here dramatically improves network stuff. E.g. havingeth0 Link encap:Ethernet HWaddr 00:15:b2:a9:8f:6e
I did:My environment:
In my case LRO was turned off. Hope it helps someone.
We’re also experiencing a very similar issue between the bridge network. Several large requests will go through quickly, then the next one or two requests will hang for a long while trying to establish a connection to the other docker container, then my suspicion is something in docker is restarted or refreshed, and we get 5 or 6 more large requests through quickly, then another hang until the same restart point is reached.
Very cumbersome to deal with, and not something that restarting the docker daemon is unacceptable for our situation to solve.
Great troubleshooting, @randunel . Thanks a lot - helped me figure out the situation as we’re also experiencing absolutely the same situation with the very same straces/tcpdump captures as shown above.
Anyone heard about any plans for having this addressed? Strange thing is that there are no updates from August,2016 considering that DNS resolution is so unreliable that it cannot be trusted at all.
@angeloluciani that is more a workaround then a solution…
@thaJeztah I am still experiencing this issue with 1.9.1. Don’t have my hands on 1.10.0 yet.
The issue seems to have surfaced as an issue in the iptables-based bridging. Something happens where some containers simply stop sending valid outgoing traffic. tcpdump didn’t really offer conclusive diagnostics. One work-around I found was to run iptables-save, restart iptables (flush), then iptables-restore. The networking to the container resumes without restarting the container instance.
I have not tried the dummy mac address workaround so I can’t confirm that works. This is difficult to troubleshoot because as soon as we poke and prod, the services revive. Here’s what’s brought it back to life.
I think the common thing between these is that docker reconfigures the iptables in each case. If it’s not an docker issue directly (iptables) I believe this is a docker networking design issue and should be looked into at the very least.
FWIW we were able to eliminate DNS issues we had on larger machines that performed a high rate of DNS resolution by running a local cache (dnsmasq). It’s been a few months, but I believe this was related to some kind of port exhaustion. These would stack up in the conntrack table and once hit 65Kish things would go south.
@brudrafon Thanks for the info.
The symptoms here are very subtly different than the article you posted. The article describes the symptoms as the host losing connectivity. In this case, the docker containers are the instances that lose connectivity, not the host.
For example, I have a single host running 5 containers container a on port 3000 container b on port 3001 etc…
The symptoms are any one random container is suddenly unable to make outgoing calls in a timely fashion. The container is receiving requests but is not able to respond because it is unable to communicate with dependent services (e.g. DNS, DB, etc.) DNS being the most obvious example. Some, none, or all containers may or may not fail at the same time. All while this is happening, connectivity to/from the host remains unaffected. Restarting the docker daemon works around the issue temporarily but only delays the inevitable.
I will give the info in the article a try though but even if it helps it is still just a workaround. Docker really must either fix this in their project or champion the upstream fix in the bridge driver.