weave: "Sometimes containers cannot connect but they still respond to ping"

Seems similar to another report on Slack from @fermayo; TL;DR is that Weave is apparently seeing the same MAC as coming from two different peers, and this happens at the same time as networking gets broken to the container that really owns that MAC.

I tried to recreate locally and failed.

As reported on weave-users Slack:

jakolehm "we have really strange problem with weave net… sometimes containers cannot connect but they still respond to ping (weave 1.4.5) and this happens on coreos stable (latest) and this happens pretty randomly let’s say we have two peers, A and B … if we send request from peer A to container in peer B we don’t see packet in peer B at all

bryan do you see it on the weave bridge? jakolehm no only on peer A weave bridge bryan ok, this is crossing two hosts, so you have two weave bridges (edited) jakolehm yes ping seems to work, and we see the packets

bryan Check the destination MAC address does actually correspond to the container you are trying to hit jakolehm it does, checked but, ping reply mac was different jakolehm actually ping reply mac address is something that we cannot find in any of the machines in this cluster jakolehm actually it seems that request destination mac is wrong also for tcp connections

12:56:37.698792 4e:fa:a8:29:0a:e2 (oui Unknown) > ba:6e:b6:d8:d8:b9 (oui Unknown), ethertype IPv4 (0x0800), length 74: 10.81.31.139.50716 > weave-test-3.gcp-1.kontena.local.http: Flags [S], seq 31933074
77, win 27400, options [mss 1370,sackOK,TS val 286470583 ecr 0,nop,wscale 7], length 0

ba:6e:b6:d8:d8:b9 should be f2:e2:6e:f4:a3:ce

matthias what host does weave think that MAC is on? logs should tell you. jakolehm

INFO: 2016/07/06 10:42:08.675404 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:26:34.352968 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:41:12.600574 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:07:34.364139 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:10:06.398987 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:14:08.173212 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:18:01.123991 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:20:26.996998 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)

jakolehm bryan: what is a “local mac” … where does weave get that?

bryan It’s printed when we see a packet with a source mac we have never seen before, on the weave bridge.

Since there ought to be no way for packets to get onto the bridge except from a locally-running container, we think it’s from one of those.

jakolehm but I can’t find that mac on “gcp-1-4” machine bryan it’s possible it went away jakolehm but I restarted weave and it’s coming back… bryan that’s interesting jakolehm one of the first locally discovered macs bryan I guess you could tcpdump the weave bridge and see if the packet itself gives any clues this is somewhat consistent with the “MAC associated with another peer” message - if we’ve never seen the src address before we print “local MAC”, and if we have seen it on another peer we print “associated …” so, since you do get the latter, it must be something of a race which one is taken as the “real” home of the packet and the real question is how come we are seeing packets with the same src address on two different weave bridges?

jakolehm

13:59:53.379567 ARP, Request who-has weave-test-3.gcp-1.kontena.local tell 10.81.31.139, length 28
13:59:53.379578 ARP, Reply weave-test-3.gcp-1.kontena.local is-at ba:6e:b6:d8:d8:b9 (oui Unknown), length 28

bryan and is that container on a different machine? jakolehm yes

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 40 (21 by maintainers)

Most upvoted comments

@panuhorsmalahti unfortunately we have two sets of symptoms in this issue; please can you open your own issue to avoid making things worse?