weave: [fastdp] can get stuck in 'sleeve' mode even though fastdp is possible

As things stand, fastdp can only be selected once, during the first HeartbeatTimeout seconds of a connection’s life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in ‘sleeve’ mode.

About this issue

Original URL
State: closed
Created 9 years ago
Reactions: 5
Comments: 32 (15 by maintainers)

Commits related to this issue

Add logic to re-try chooseBest() which will select best forwarder changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient err... — committed to weaveworks/weave by murali-reddy 6 years ago
changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forward... — committed to weaveworks/weave by murali-reddy 6 years ago
changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forward... — committed to weaveworks/weave by murali-reddy 6 years ago

Most upvoted comments

I am working on the fix for this issue.

murali-reddy on Aug 20, 2018

Sorry, but AFAIK there is no WIP or a design doc. Also, I haven’t thought about it much.

https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either sleeve or fastdp) to use between two peers. Each overlay sends a periodic heartbeat messages to detect any failures in an overlay connection. E.g. when fastdp times out waiting for a heartbeat message from a peer (https://github.com/weaveworks/weave/blob/v2.2.0/router/fastdp.go#L713) it notifies overlay_switch which consequently chooses another best overlay for the connection (https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go#L318).

A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing overlay_switch implementation will handle the rest.

brb on Mar 5, 2018

Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

@Cryptophobia When there is an issues (due to which connection gets dropped from fastdp to sleeve) you should see a log message like below.

DEBU: 2018/08/30 09:32:00.781565 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2018/08/30 09:32:00.781656 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Removing fastdp from the list of forwarders

and then you should recovery after retry messages like below

DEBU: 2018/08/30 09:32:40.784595 fastdp ->[192.168.56.100:6784|7e:ee:ea:63:fb:d1(192.168.56.100)]: got Heartbeat from peer, marking fastdp forwarder as healthy
DEBU: 2018/08/30 09:32:40.784655 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] Adding fastdp to the list of forwarders
INFO: 2018/08/30 09:32:40.784673 overlay_switch ->[7e:ee:ea:63:fb:d1(192.168.56.100)] using fastdp

Please see if you have a pattern like this which confirms connections are getting upgraded to fastdp

murali-reddy on Sep 5, 2018

Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now.

Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:

DEBU: 2018/09/04 18:59:49.749160 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sendHeartbeat
DEBU: 2018/09/04 18:59:49.749792 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: sending Heartbeat to peer
DEBU: 2018/09/04 18:59:49.754691 sleeve ->[172.33.40.133:6783|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleHeartbeat
DEBU: 2018/09/04 18:59:49.759046 ODP miss: map[3:InPortFlowKey{vport: 2} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 22788, tpdst: 6784} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 4:EthernetFlowKey{src: 00:00:00:00:00:00, dst: 00:00:00:00:00:00}] on port 2
DEBU: 2018/09/04 18:59:49.759122 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: handleVxlanSpecialPacket
DEBU: 2018/09/04 18:59:49.759149 fastdp ->[172.33.40.133:6784|1e:55:d5:c2:25:ab(ip-172-33-40-133.ec2.internal)]: Got Heartbeat Ack from peer
DEBU: 2018/09/04 18:59:50.619606 ODP miss: map[19:BlobFlowKey{type: 19, key: 00000000, mask: ffffffff} 18:BlobFlowKey{type: 18, key: 0002, mask: ffff} 3:InPortFlowKey{vport: 2} 9:BlobFlowKey{type: 9, key: cf4618eb, mask: ffffffff} 25:UnknownFlowKey{type: 25, key: 00000000000000000000000000000000, mask: exact} 23:UnknownFlowKey{type: 23, key: 0000, mask: exact} 2:BlobFlowKey{type: 2, key: 00000000, mask: ffffffff} 22:UnknownFlowKey{type: 22, key: 00000000, mask: exact} 16:TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28, ttl: 64, tpsrc: 36937, tpdst: 6784} 6:BlobFlowKey{type: 6, key: 0800, mask: ffff} 4:EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39} 15:BlobFlowKey{type: 15, key: 00000200, mask: ffffffff} 20:BlobFlowKey{type: 20, key: 00000000, mask: ffffffff} 24:UnknownFlowKey{type: 24, key: 00000000, mask: exact} 7:BlobFlowKey{type: 7, key: 646c00026460000d06004000, mask: ffffffffffffffffffffffff}] on port 2
DEBU: 2018/09/04 18:59:50.619913 Creating ODP flow FlowSpec{keys: [InPortFlowKey{vport: 2} TunnelFlowKey{id: 0000000000fbc75b, ipv4src: 172.33.88.133, ipv4dst: 172.33.140.28} EthernetFlowKey{src: c5:d2:f2:ef:e9:5f, dst: 32:ez:32:25:5f:39}], actions: [OutputAction{vport: 1}]}
DEBU: 2018/09/04 18:59:50.620116 Creating ODP flow FlowSpec{keys: [EthernetFlowKey{src: 32:ez:32:25:5f:39, dst: c5:d2:f2:ef:e9:5f} InPortFlowKey{vport: 1}], actions: [SetTunnelAction{id: 000000000075bfbc, ipv4src: 172.33.140.28, ipv4dst: 172.33.88.133, tos: 0, ttl: 64, df: true, csum: false} OutputAction{vport: 2}]}

Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?

Cryptophobia on Sep 4, 2018

So i made a fix https://github.com/weaveworks/weave/pull/3385 for this issue and tested out.

If any one wishes to help out with testing please use the image muralireddy/weave-kube:retry-fastdp

I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp).

murali-reddy on Aug 29, 2018

@Cryptophobia @brb Is there a way to measure when could the switch to sleeve can happen - in our case it was heartbeat timeout(got it from weave log) But why timeout happened we don’t know for sure. Any prometheus metric/log info which can help us figure out the reasons for timeout. @brb mentioned some reasons here - #3252 (comment)

I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods https://github.com/weaveworks/weave/issues/3252#issuecomment-371467574 outlined here in order to isolate.

Cryptophobia on Aug 27, 2018

Is there a way to measure when could the switch to sleeve can happen

There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment https://github.com/weaveworks/weave/issues/3376#issuecomment-412124840

murali-reddy on Aug 27, 2018

thx @brb

notmaxx on Jun 17, 2018