kubernetes: CRI getFsInfo logs errors for valid filesystems mounted after kubelet start
What happened:
New overlay filesystems created for Pods after kubelet start trigger repeated error messages. For example:
cri_stats_provider.go:375] Failed to get the info of the filesystem with mountpoint "/var/lib/containers/storage/overlay/2243263dc8ba6c55d8867d3e47bcaaa145d15f0ace19420824d639d86defe2e3/merged": failed to get device for dir "/var/lib/containers/storage/overlay/2243263dc8ba6c55d8867d3e47bcaaa145d15f0ace19420824d639d86defe2e3/merged": could not find device with major: 0, minor: 83 in cached partitions map.
Error messages are logged every 10 seconds for every filesystem created after kubelet start. For large kubelet nodes, this issue creates an excessive amount of error logs.
What you expected to happen:
New overlay filesystems created for Pods after kubelet start should not log error messages.
How to reproduce it (as minimally and precisely as possible):
Use cri_stats_provider with cri-o and configure overlay storage. UsingLegacyCadvisorStats() needs to return false per https://github.com/cri-o/cri-o/pull/3054 for cri_stats_provider to be used. Start with an empty /var/lib/containers/storage and schedule Pods on the node.
Anything else we need to know?:
As part of https://github.com/kubernetes/kubernetes/pull/59475, cadvisor.GetFsInfoByFsUUID() was changed to cadvisor.GetDirFsInfo() in getFsInfo(); however, the test for cadvisorfs.ErrNoSuchDevice was not modified, changing how getFsInfo() handles a cache miss from cadvisor.
cadvisor caches UUIDs/partitions/mounts on kubelet start and the cache is never refreshed. Given that it’s expected that getFsInfo() will fail to retrieve data from cadvisor (see comments in ImageFsStats() and https://github.com/kubernetes/heapster/issues/1793), this condition should not be logged as an error. Alternatively, cadvisor fs could be modified to refresh on cache miss enabling getFsInfo() to return data on these filesystems.
Environment:
- Kubernetes version (use
kubectl version): 1.18.6 - CRI: CRI-O
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 45 (19 by maintainers)
Hi folks, I still have this issue in most recent Kubernetes 1.24.6. cAdvisor 0.44.0 that fix google/cadvisor/pull/3018 went into that version, is now a long time ago already in 1.24.0. https://github.com/kubernetes/kubernetes/pull/109029 Dep bump to runc 1.1.0, cadvisor 0.44.0 https://github.com/kubernetes/kubernetes/pull/109675 Automated cherry pick of #109658: Bump cAdvisor to v0.44.1 28.4.2022
So is there possibly any chance to progress with a review of this PR please? https://github.com/kubernetes/kubernetes/pull/100448
@haircommander @ml-
Right now I am thinking of either handling the cache miss better,
or returning the
DeviceInfoeven if the fs type is notbtrfsin cadvisor,or both.