kubernetes: [sig-node] Summary API [NodeConformance] when querying /stats/summary ... networking info is nil on containerd

Failure cluster 79384df20f4e672ed9e1

Test: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api Job:

  • ci-cgroup-systemd-containerd-node-e2e
  • ci-cos-containerd-node-e2e
  • pull-kubernetes-node-e2e-containerd
Error text:
test/e2e_node/summary_test.go:53
Timed out after 180.001s.
Expected
    <string>: Summary
to match fields: {
.Pods[summary-test-3122::stats-busybox-1].Network:
	Expected
	    <string>: NetworkStats
	to match fields: {
	[.InterfaceStats.Name:
		Expected
		    <string>: 
		to equal
		    <string>: eth0, .InterfaceStats.RxBytes:
		Expected
		    <*uint64 | 0x0>: nil
		not to be <nil>, .InterfaceStats.RxErrors:
		Expected
		    <*uint64 | 0x0>: nil
		not to be <nil>, .InterfaceStats.TxBytes:
		Expected
		    <*uint64 | 0x0>: nil
		not to be <nil>, .InterfaceStats.TxErrors:
		Expected
		    <*uint64 | 0x0>: nil
		not to be <nil>]
	.Interfaces:
		Expected
		    <[]v1alpha1.InterfaceStats | len:0, cap:0>: nil
		not to be nil
	}
	
}

test/e2e_node/summary_test.go:327

Recent failures:

3/27/2022, 9:20:16 AM ci-cgroup-systemd-containerd-node-e2e 3/27/2022, 3:20:08 AM ci-cos-containerd-node-e2e 3/26/2022, 9:20:08 PM ci-cos-containerd-node-e2e 3/26/2022, 4:06:12 AM ci-cos-containerd-node-e2e 3/25/2022, 11:58:17 AM ci-cos-containerd-node-e2e

Started flaking 03/23.

Screenshot at 2022-03-28 12-42-42 Screenshot at 2022-03-28 12-47-32 (second screenshot includes PRs)

/kind flake

/sig node

/priority important-soon /milestone v1.24

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 66 (66 by maintainers)

Most upvoted comments

image

https://github.com/google/cadvisor/pull/3103 is ready now. the CI jobs for cadvisor are 🟢 now as well.

@bobbypage can you please merge it, release v0.45 and open a PR for k/k to update the version of k8s.

thanks, Dims

I don’t see the metrics grabber related to this, there is a bug on the getNodeSummary though, it doesn’t work for ipv6 addresses

diff --git a/test/e2e_node/util.go b/test/e2e_node/util.go
index 1c8b1699f74..8f803460e68 100644
--- a/test/e2e_node/util.go
+++ b/test/e2e_node/util.go
@@ -23,9 +23,11 @@ import (
        "flag"
        "fmt"
        "io"
+       "net"
        "net/http"
        "os/exec"
        "regexp"
+       "strconv"
        "strings"
        "time"
 
@@ -82,7 +84,7 @@ func getNodeSummary() (*stats.Summary, error) {
        if err != nil {
                return nil, fmt.Errorf("failed to get current kubelet config")
        }
-       req, err := http.NewRequest("GET", fmt.Sprintf("http://%s:%d/stats/summary", kubeletConfig.Address, kubeletConfig.ReadOnlyPort), nil)
+       req, err := http.NewRequest("GET", fmt.Sprintf("http://%s/stats/summary", net.JoinHostPort(kubeletConfig.Address, strconv.Itoa(kubeletConfig.ReadOnlyPort))), nil)
        if err != nil {
                return nil, fmt.Errorf("failed to build http request: %v", err)
        }

Also, the

			gomega.Eventually(getNodeSummary, 180*time.Second, 15*time.Second).Should(matchExpectations)
			ginkgo.By("Validating /stats/summary are consistent")
			// Then the summary should match the expectations a few more times.
			gomega.Consistently(getNodeSummary, 30*time.Second, 15*time.Second).Should(matchExpectations)

are not providing any information, I think we should use something different to compare to avoid artifacts failing the comparison.

I’ll submit a patch

btw, now we are also trying to repro bug with logging: https://github.com/kubernetes/kubernetes/pull/109472 so far no luck. @mmiranda96 and @bobbypage tried locally, never succeeded to repro. @ruiwen-zhao had some luck with repro before: https://github.com/kubernetes/kubernetes/pull/109371 but it seems not any longer.

@liggitt this test is generally very flaky in part because it tests so many (too many?) things…

https://github.com/kubernetes/kubernetes/issues/108836 had issues with CPUStats https://github.com/kubernetes/kubernetes/issues/104292 was swap-only?

This failure is networkstats timing out specifically.

reproduced with a bit more logging in https://github.com/kubernetes/kubernetes/pull/109371#issuecomment-1104422428

TLDR, my change below caused a ā€œinvalid memory address or nil pointer dereferenceā€ intermittently:

framework.Logf("ruiwen-zhao: /stats/suammry: %+v", summary)
	for _, p := range summary.Pods {
		framework.Logf("ruiwen-zhao: NetworkStats: %+v", *p.Network)
	}

So I guess p.Network is an empty pointer when the test fails.

@mikebrow we may try the approach like #101960 to fix this.

Also checking time of the test run in the latest 5 runs, and the earliest 5 runs, I don’t see a huge difference

// Latest 5 runs - all passing
E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m1s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m4s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	49s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m4s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m8s

---
// 5 runs after 2022/03/23 21:14:10, all passing

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	55s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m6s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	50s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	1m4s

E2eNode Suite: [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api	51s

So the passing tests are pretty stable and take only ~ 1min, whereas failed runs take more than 3 mins. This does not look like a performance regression.

@helayoty this is one of three issues SIG Node is tracking as a possible regression in 1.24, we need a clear answer before it can be removed from the milestone.

šŸ‘ for explanation; šŸ‘Ž for unclear test signal 😃