kubernetes: "Services should be rejected when no endpoints exist" is flaky

The test “Services should be rejected when no endpoints exist” is flaky. eg, currently on https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce there are 4 flakes in the last 10 days (or whatever it is):

The problem is that the test assumes that every bad connection will be rejected with an ICMP Destination Unreachable, but in fact the kernel rate-limits ICMP errors:

icmp_ratelimit - INTEGER
	Limit the maximal rates for sending ICMP packets whose type matches
	icmp_ratemask (see below) to specific targets.
	0 to disable any limiting,
	otherwise the minimal space between responses in milliseconds.
	Note that another sysctl, icmp_msgs_per_sec limits the number
	of ICMP packets	sent on all targets.
	Default: 1000

...

icmp_ratemask - INTEGER
	Mask made of ICMP types for which rates are being limited.
	Significant bits: IHGFEDCBA9876543210
	Default mask:     0000001100000011000 (6168)

	Bit definitions (see include/linux/icmp.h):
		0 Echo Reply
		3 Destination Unreachable *
		4 Source Quench *
		5 Redirect
		8 Echo Request
		B Time Exceeded *
		C Parameter Problem *
		D Timestamp Request
		E Timestamp Reply
		F Info Request
		G Info Reply
		H Address Mask Request
		I Address Mask Reply

	* These are rate limited by default (see default mask above)

so depending on what else is going on, the ICMP replies expected by this test might get suppressed.

There is some variation on just how flaky this test is; it flakes much more in OpenShift than in upstream kubernetes. It’s not clear yet if this is because of stuff in OpenShift or just because of different base Linux distros between the two test environments. (eg, OpenShift on RHEL 7 flakes much more than OpenShift on RHEL 8.)

So, minimally, the test should retry a few times so that it’s more likely to get the expected result even when rate-limiting is happening.

Possibly we could rewrite the test so that rather than relying on actually getting an ICMP Destination Unreachable, we just use iptables tracing to ensure that we are hitting the -j REJECT rule as expected, and call that good enough.

Although, either way, we’re effectively saying that “services should be rejected when no endpoints exist” is an ideal but not actually a promise…

/sig network /priority important-soon

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 16 (14 by maintainers)

Most upvoted comments

@aojea note that I was linking to PR runs not periodic runs (since I had that link handy) so in some cases failures might be due to bad PRs…

danwinship on May 12, 2020

wow, are we really hitting these limits?

In OCP it looks like we’re hitting a kernel bug such that the rate limiting counters may not get reset correctly in some circumstances (which is making us fail the test almost 100% of the time with RHEL 8.2 kernels). That’s being worked on.

For the k8s CI flakes, I’m guessing that the problem is that sometimes the parallelization of tests ends up pitting the “rejected when no endpoints” test up against some other test that does something that generates a lot of ICMP messages (either Destination Unreachable or something else; there’s only a single rate limiter for all rate-limited ICMP message types).

danwinship on May 9, 2020