weave: "Sometimes containers cannot connect but they still respond to ping"
Seems similar to another report on Slack from @fermayo; TL;DR is that Weave is apparently seeing the same MAC as coming from two different peers, and this happens at the same time as networking gets broken to the container that really owns that MAC.
I tried to recreate locally and failed.
As reported on weave-users Slack:
jakolehm "we have really strange problem with weave net… sometimes containers cannot connect but they still respond to ping (weave 1.4.5) and this happens on coreos stable (latest) and this happens pretty randomly let’s say we have two peers, A and B … if we send request from peer A to container in peer B we don’t see packet in peer B at all
bryan do you see it on the weave bridge? jakolehm no only on peer A weave bridge bryan ok, this is crossing two hosts, so you have two weave bridges (edited) jakolehm yes ping seems to work, and we see the packets
bryan Check the destination MAC address does actually correspond to the container you are trying to hit jakolehm it does, checked but, ping reply mac was different jakolehm actually ping reply mac address is something that we cannot find in any of the machines in this cluster jakolehm actually it seems that request destination mac is wrong also for tcp connections
12:56:37.698792 4e:fa:a8:29:0a:e2 (oui Unknown) > ba:6e:b6:d8:d8:b9 (oui Unknown), ethertype IPv4 (0x0800), length 74: 10.81.31.139.50716 > weave-test-3.gcp-1.kontena.local.http: Flags [S], seq 31933074
77, win 27400, options [mss 1370,sackOK,TS val 286470583 ecr 0,nop,wscale 7], length 0
ba:6e:b6:d8:d8:b9
should be f2:e2:6e:f4:a3:ce
matthias what host does weave think that MAC is on? logs should tell you. jakolehm
INFO: 2016/07/06 10:42:08.675404 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:26:34.352968 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:41:12.600574 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:07:34.364139 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:10:06.398987 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:14:08.173212 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:18:01.123991 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:20:26.996998 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
jakolehm bryan: what is a “local mac” … where does weave get that?
bryan It’s printed when we see a packet with a source mac we have never seen before, on the weave bridge.
Since there ought to be no way for packets to get onto the bridge except from a locally-running container, we think it’s from one of those.
jakolehm but I can’t find that mac on “gcp-1-4” machine
bryan it’s possible it went away
jakolehm but I restarted weave and it’s coming back…
bryan that’s interesting
jakolehm one of the first locally discovered macs
bryan I guess you could tcpdump
the weave bridge and see if the packet itself gives any clues
this is somewhat consistent with the “MAC associated with another peer” message - if we’ve never seen the src address before we print “local MAC”, and if we have seen it on another peer we print “associated …”
so, since you do get the latter, it must be something of a race which one is taken as the “real” home of the packet
and the real question is how come we are seeing packets with the same src address on two different weave bridges?
jakolehm
13:59:53.379567 ARP, Request who-has weave-test-3.gcp-1.kontena.local tell 10.81.31.139, length 28
13:59:53.379578 ARP, Reply weave-test-3.gcp-1.kontena.local is-at ba:6e:b6:d8:d8:b9 (oui Unknown), length 28
bryan and is that container on a different machine? jakolehm yes
About this issue
- Original URL
- State: open
- Created 8 years ago
- Comments: 40 (21 by maintainers)
@panuhorsmalahti unfortunately we have two sets of symptoms in this issue; please can you open your own issue to avoid making things worse?