weave: [fastdp] can get stuck in 'sleeve' mode even though fastdp is possible
As things stand, fastdp can only be selected once, during the first HeartbeatTimeout
seconds of a connection’s life. If we fail to establish fastdp connectivity during that time, or if it later fails, then we are stuck in ‘sleeve’ mode.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 5
- Comments: 32 (15 by maintainers)
Commits related to this issue
- Add logic to re-try chooseBest() which will select best forwarder changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient err... — committed to weaveworks/weave by murali-reddy 6 years ago
- changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forward... — committed to weaveworks/weave by murali-reddy 6 years ago
- changes differentiate beteween fatal errors (for e.g. ipsec init) from transient errors (heartbeast misses). In case of transient errors OverlayForwarder is marked to be unhealty. When overlay forward... — committed to weaveworks/weave by murali-reddy 6 years ago
I am working on the fix for this issue.
Sorry, but AFAIK there is no WIP or a design doc. Also, I haven’t thought about it much.
https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go is responsible for choosing which overlay (in most cases it is either
sleeve
orfastdp
) to use between two peers. Each overlay sends a periodic heartbeat messages to detect any failures in an overlay connection. E.g. whenfastdp
times out waiting for a heartbeat message from a peer (https://github.com/weaveworks/weave/blob/v2.2.0/router/fastdp.go#L713) it notifiesoverlay_switch
which consequently chooses another best overlay for the connection (https://github.com/weaveworks/weave/blob/v2.2.0/router/overlay_switch.go#L318).A possible fix is to re-try establishing a connection (with a backoff timer) after it failed due to a missing heartbeat. After it got established, the existing
overlay_switch
implementation will handle the rest.@Cryptophobia When there is an issues (due to which connection gets dropped from
fastdp
tosleeve
) you should see a log message like below.and then you should recovery after retry messages like below
Please see if you have a pattern like this which confirms connections are getting upgraded to
fastdp
Great, thank you @murali-reddy . 👍 I am able to test in two of the staging clusters now.
Looks like I am getting lots of these messages which are good indication that the connections are upgraded to fastdp:
Will run for longer and monitor the outputs in the debug logs. Do these messages look good to you like the HeartBeat is acknowledged and the connection is switched?
So i made a fix https://github.com/weaveworks/weave/pull/3385 for this issue and tested out.
If any one wishes to help out with testing please use the image
muralireddy/weave-kube:retry-fastdp
I am not sure how to create a situation where we have real handshake timeout which will trigger the fallback to sleeve mode. I injected the failure in the code and tested out, fallback (fastdp->sleeve) and retry (sleeve->fastdp).
I believe that CPU or kernel (offloading ipsec) are the most likely culprits but not sure what the best way to verify would be. Running a docker image and doing test using methods https://github.com/weaveworks/weave/issues/3252#issuecomment-371467574 outlined here in order to isolate.
There is a request to provide status as prometheus metric #3376 which is not implemented yet, but you may be able to infer as per the comment https://github.com/weaveworks/weave/issues/3376#issuecomment-412124840
thx @brb