amazon-vpc-cni-k8s: Couldn't get resource list for external.metrics.k8s.io/v1beta1
Image: 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.4.1
Symptoms
All nodes are ready and healthy. Secondary IPs are begin allocated. But the aws-node
pod periodically prints the following error message.
ERROR: logging before flag.Parse: E0525 01:44:40.871589 11 memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1: <nil>
Why is it trying to access the external metrics anyway?
Extra informations
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:52:13Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl describe pod aws-node-4g4wp -n kube-system
Name: aws-node-4g4wp
Namespace: kube-system
Priority: 2000001000
PriorityClassName: system-node-critical
Node: <redacted>.compute.internal/x.x.x.x
Start Time: Sat, 25 May 2019 10:27:21 +0900
Labels: controller-revision-hash=7f8557c6b8
k8s-app=aws-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.100.3.5
Controlled By: DaemonSet/aws-node
Containers:
aws-node:
Container ID: docker://7139948319c5d13793a4349bd8a891f6a362546c839a8070a4638dc210b9473d
Image: 602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.4.1
Image ID: docker-pullable://602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni@sha256:7652301be646d5796febc547a29781f4da6bdd61abc4940e22ec3098c884211e
Port: 61678/TCP
Host Port: 61678/TCP
State: Running
Started: Sat, 25 May 2019 10:27:30 +0900
Ready: True
Restart Count: 0
Requests:
cpu: 10m
Environment:
AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
MY_NODE_NAME: (v1:spec.nodeName)
WATCH_NAMESPACE: kube-system (v1:metadata.namespace)
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/host/var/log from log-dir (rw)
/var/run/docker.sock from dockersock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-8cbrg (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dockersock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
aws-node-token-8cbrg:
Type: Secret (a volume populated by a Secret)
SecretName: aws-node-token-8cbrg
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations:
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 39
- Comments: 36 (15 by maintainers)
Cause
The issue happens when the external metrics API is enabled (likely prometheus adaptor deployed) but no external metric declared. This is a totally valid scenario but is not handled correctly.
You can check if you’re in this situation by running the following command and checking that the
ressources
field is empty.This behaviour comes from the k8s/client-go and was fixed in november 2018 by this commit: https://github.com/kubernetes/client-go/commit/58652542545735c4b421dbf3631d81c7ec58e0c1#diff-8cc59cc0342ea488b3db0c5a07f4bffda3edb171ca58c6fc0265d10e515018dfL134
amazon-vpc-cni-k8s
uses an old version of the client-go library, from before the fix: https://github.com/aws/amazon-vpc-cni-k8s/blob/0d5bc7ea411d0202d799a1ccc8c35b6bdba0b9ce/go.mod#L50 (2018-08)How to fix
To fix this issue one has to update the k8s/client-go to a version that contains the commit mentioned earlier.
Workaround
If you don’t want the prometheus adapter
Properly remove the prometheus-adapter chart, it will unregister external and custom metrics apis from the apiserver.
If you do want the prometheus adpater
We don’t control the CNI version AWS deploys when we request an EKS cluster, so as long as the library is not updated we’ll have to live with it. The easiest way to make sure the error stops popping is to declare a dummy external metric.
For example with the prometheus-adapter chart one could do:
This appears to be happening in 1.5 too.
I’ve deployed version 1.6 on our clusters and it seems to have been resolved! Maybe someone can verify this for their situation?
We’re still getting this on 1.6.3. Any ideas?
I noticed that after installing the CloudWatch adapter (
kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/deploy/adapter.yaml
) these errors changed from the external.metrics endpoint to custom.metrics:ERROR: logging before flag.Parse: E0716 12:54:37.701124 13 memcache.go:147] couldn't get resource list for custom.metrics.k8s.io/v1beta1: <nil>
It would be interesting to see what metric(s)aws-node
is looking for and what impact on the cluster the errors have.Since issue #950 has been closed, will update here. On EKS version
v1.17.9-eks-4c6976
with aws-node versionv1.7.3
the custom metrics error still gets logged.Our deployment file is latest according to example for 1.7 config
same issue on EKS
v1.17.9-eks-4c6976
with aws-node versionv1.7.1
the container keeps restarting:EDIT:
For me the errors are caused by outdated aws-cni deployments. You might have an outdated version (like me), check here and update your cluster accordingly. This solved the error and now everything works okay.
This error (
memcache.go:147] couldn't get resource list for external.metrics.k8s.io/v1beta1
) means an aggregated api server has been configured in k8s, but is not functional for some reason. The vpc-cni code (somewhere deep in client-go) is trying to list all available cluster resources (just the discovery metadata, not list the actual resources themselves) - this is ~common in client-go, and the internet is full of bug reports like this one 😕 You can reproduce the above by doing something likekubectl api-versions
.I agree that vpc-cni doesn’t actually need this information, which is why the OP found that vpc-cni continued just fine despite this “error” message.
Happens on v1.5.5 too, but does not seem to affect anything.
Client-go is updated and operator-sdk dependency is removed - https://github.com/aws/amazon-vpc-cni-k8s/pull/1419. Release 1.8 has these fixed. Closing the issue for now please try and let us know if the issue still exists.
Just ignore the “error”? It’s purely noise. (I think the “failed to start” issue above was found to be unrelated?)
Confusion is bad and we should absolutely fix the code to not report this “error” - which will require a change to client-go. … but at the same time, there’s no actual functionality impact here, so it’s going to be a low priority for everyone.
If you have this error yourself, and you want it to go away: Your best bet is going to be to fix the metrics service (or whatever aggregated api server is reported in the error) or remove the relevant apiservice registration (
kubectl get apiservices
). Test withkubectl api-versions
.We have the same issue, and for us, the v1beta1.external.metrics.k8s.io apiservice has been created for the datadog cluster agent. https://docs.datadoghq.com/agent/cluster_agent/external_metrics/
I’m not sure where to go from here.
For 1, it would probably be one of 2 things:
@SaranBalaji90 It is starting. This doesn’t seem to cause any issues that we can tell.
I’m seeing this when going from 1.6.1 to 1.6.2 and rolling back to 1.6.1 seemed to have no effect
I’ve tried extending the probes as described here https://github.com/aws/amazon-vpc-cni-k8s/issues/872 as I noticed the same timeout 1s issue, but same effect. Theres no eksctl here.
Just updated the plugin in all 4 of my clusters and I can confirm 1.6 resolves this issue