kubernetes: [Failing Test] kind-ipv6-master-parallel (ci-kubernetes-kind-ipv6-e2e-parallel)

Which jobs are failing:

kind-ipv6-master-parallel (ci-kubernetes-kind-ipv6-e2e-parallel)

Which test(s) are failing:

[sig-network] Services should create endpoints for unready pods is failing consistently.

Since when has it been failing:

07-09-2020 11:39 CDT

Testgrid link:

https://testgrid.k8s.io/sig-release-master-blocking#kind-ipv6-master-parallel

Reason for failure:

test/e2e/network/service.go:1979
Jul 10 05:27:25.533: expected un-ready endpoint for Service slow-terminating-unready-pod within 5m0s, stdout: 
test/e2e/network/service.go:2061

Anything else we need to know:

/cc @kubernetes/ci-signal /sig network /priority critical-urgent /milestone v1.19

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (18 by maintainers)

Most upvoted comments

At long last, I think I have a solution: https://github.com/kubernetes/kubernetes/pull/93089. It is admittedly rather ugly so I’m very open to other approaches for fixing the bug.

For a bit more context, when I added manual IP diffing yesterday I’d actually fixed the root issue but didn’t realize it due to a DNS issue continuing to cause the test to fail. (I didn’t initially notice the difference in curl error codes). @danwinship was completely right that getEndpointAddresses should have done the trick, but unfortunately we don’t always have a reliable source for Service IP family at that point and that’s why this only showed up on IPv6 tests. This also only showed up because IPv6 IP allocation was slow enough that the Pod IP was not allocated by the time syncService ran so we had to rely on the pod change detection to handle this.

Thanks also to @aojea for all the help in tracking this one down, this one took awhile to figure out and he was very helpful in talking through various theories and even just helping me debug failing tests with Kind.

I’ve spent some time digging into this one and it seems likely that this ties to my change that makes kube-proxy read from EndpointSlices instead of Endpoints https://github.com/kubernetes/kubernetes/pull/92736. This test started failing for kind-ipv6 just after this was merged. I did try building a new cluster from master now and running this test and it consistently passed. It seems like there’s got to be something more here either related to dualstack or Kind.

I also took some time to look at the related code in both the EndpointSlice and Endpoints controllers as well as kube-proxy’s parsing of EndpointSlices and Endpoints and could not find a meaningful difference here. Going to try to figure out if there’s anything unique around dual stack that would affect this.

/assign /cc @aojea

I’m not sure why, but @aojea did create a follow up issue to look into it. As I was testing I think even a 1 second delay between pod and service creation was sufficient for an IPv6 address to be consistently assigned to the Pod before syncService was called. Without that, the initial syncService call always happened before an IPv6 address was assigned, but this was never an issue for IPv4 clusters in my testing.

how to repro, I didn’t have time to dig more into this, but for reference:

# cd to your kubernetes repo
kind build node-image
kind create cluster --image kindest/node:latest
make WHAT="test/e2e/e2e.test vendor/github.com/onsi/ginkgo/ginkgo cmd/kubectl"
kind get kubeconfig > kconf
./_output/local/go/bin/e2e.test -kubeconfig kconf -ginkgo.focus="Services should create endpoints for unready pods"

OK, one more update for the day, sadly no root cause yet, just been ruling out potential reasons for this to break. I ran this test a few more times on my e2e cluster and it consistently passed. I modified it to just leave the setup in place so I could look at the iptables config with and without the DualStack feature flag enabled, and it looks like it worked well in both cases.

Here’s the diff from the iptables-save output on a node first without dualstack enabled and second with dualstack enabled (--cluster-cidr=10.64.0.0/14,fde4:8dba:82e1::/48 --feature-gates="IPv6DualStack=true" ): https://www.diffchecker.com/6gU6VMfk. Although there are slight variations in the diff, none of them relate to the tolerate-unready Service. Those lines are identical in both cases. Maybe I need to go further and set both a v4 and v6 IP on the Service, not sure what else I need to properly replicate this. My cluster is using defaults everywhere except for the above kube-proxy config changes.

Maybe there’s something with Kind that’s affecting this and the EndpointSlice commit is not as closely tied to this as I initially thought? Or maybe the EndpointSlice implementation is adding some more delay or something that is causing the iptables lock issue to come up here as well?

~We have identified a problem in KIND that is causing flakiness, due to some components holding the iptables lock and causing disruption in the services.~ ~I will a prepare a PR in the next days, it should fix more of the flakiness, but I can´t promise anything 😉 🤞 We still have to investigate why ipv6 is more sensible to this problems than IPv4, have that in the backlog though~

scratch that, mixed up this issue with a similar one, 👁️