weave: [fastdp] can get stuck in 'sleeve' mode even though fastdp is possible

As things stand, fastdp can only be selected once, during the first HeartbeatTimeout seconds of a connection’s life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in ‘sleeve’ mode.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Reactions: 5
  • Comments: 32 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I am working on the fix for this issue.

Sorry, but AFAIK there is no WIP or a design doc. Also, I haven’t thought about it much.

https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either sleeve or fastdp) to use between two peers. Each overlay sends a periodic heartbeat messages to detect any failures in an overlay connection. E.g. when fastdp times out waiting for a heartbeat message from a peer (https://github.com/weaveworks/weave/blob/v2.2.0/router/fastdp.go#L713) it notifies overlay_switch which consequently chooses another best overlay for the connection (https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go#L318).

A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing overlay_switch implementation will handle the rest.

Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

@Cryptophobia When there is an issues (due to which connection gets dropped from fastdp to sleeve) you should see a log message like below.

DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders

and then you should recovery after retry messages like below

DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

Please see if you have a pattern like this which confirms connections are getting upgraded to fastdp

Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now.

Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:

DEBU: 2018/09/04 18:59:49.749160 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sendHeartbeat
DEBU: 2018/09/04 18:59:49.749792 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sending Heartbeat to peer
DEBU: 2018/09/04 18:59:49.754691 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleHeartbeat
DEBU: 2018/09/04 18:59:49.759046 ODP miss: map[3:InPortFlowKey{vport: 2} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 22788, tpdst: 6784} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 4:EthernetFlowKey{src: 00:00:00:00:00:00, dst: 00:00:00:00:00:00}] on port 2
DEBU: 2018/09/04 18:59:49.759122 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleVxlanSpecialPacket
DEBU: 2018/09/04 18:59:49.759149 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: Got Heartbeat Ack from peer
DEBU: 2018/09/04 18:59:50.619606 ODP miss: map[19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 18:BlobFlowKey{type: 18, key: 0002, mask: ffff} 3:InPortFlowKey{vport: 2} 9:BlobFlowKey{type: 9, key: cf4618eb, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 36937, tpdst: 6784} 6:BlobFlowKey{type: 6, key: 0800, mask: ffff} 4:EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 7:BlobFlowKey{type: 7, key: 646c00026460000d06004000, mask: ffffffffffffffffffffffff}] on port 2
DEBU: 2018/09/04 18:59:50.619913 Creating ODP flow FlowSpec{keys: [InPortFlowKey{vport: 2} TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28} EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39}], actions: [OutputAction{vport: 1}]}
DEBU: 2018/09/04 18:59:50.620116 Creating ODP flow FlowSpec{keys: [EthernetFlowKey{src: 32:ez:32:25:5f:39, dst: c5:d2:f2:ef:e9:5f} InPortFlowKey{vport: 1}], actions: [SetTunnelAction{id: 000000000075bfbc, ipv4src: 172.33.140.28, ipv4dst: 172.33.88.133, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]}

Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

So i made a fix https://github.com/weaveworks/weave/pull/3385 for this issue and tested out.

If any one wishes to help out with testing please use the image muralireddy/weave-kube:retry-fastdp

I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp).

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don’t know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods https://github.com/weaveworks/weave/issues/3252#issuecomment-371467574 outlined here in order to isolate.

Is there a way to measure when could the switch to sleeve can happen

There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment https://github.com/weaveworks/weave/issues/3376#issuecomment-412124840