kops: Too many endpoints in the kubernetes service
/kind bug
1. What kops version are you running?
Version 1.21.2 (git-f86388fb1ec8872b0ca1819cf98f84d18f7263a4)
2. What Kubernetes version are you running?
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:04:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using? AWS
Hello, we got a connectivity issue with our pods.
We currently see too many IPs in the kubernetes endpoints, many of them reference to terminated masters.
$ kubectl describe service kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 100.64.0.1
IPs: 100.64.0.1
Port: https 443/TCP
TargetPort: 443/TCP
Endpoints: 172.31.20.192:443,172.31.23.72:443,172.31.31.210:443 + 6 more...
Session Affinity: ClientIP
Events: <none>
the iptable nat table list all of them
# iptables -L -v -n -t nat
Chain KUBE-SVC-NPX46M4PTMTKRN6Y (1 references)
pkts bytes target prot opt in out source destination
2 120 KUBE-SEP-XXDHYK5XOL7C2QMK all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-XXDHYK5XOL7C2QMK side: source mask: 255.255.255.255
0 0 KUBE-SEP-T2U4L34UORPF3KEV all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-T2U4L34UORPF3KEV side: source mask: 255.255.255.255
0 0 KUBE-SEP-H7IE5EIZNU7MRD2J all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-H7IE5EIZNU7MRD2J side: source mask: 255.255.255.255
0 0 KUBE-SEP-IKGK4FJJWDT2DVXT all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-IKGK4FJJWDT2DVXT side: source mask: 255.255.255.255
0 0 KUBE-SEP-EQDDZXJQDFIUJYBY all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-EQDDZXJQDFIUJYBY side: source mask: 255.255.255.255
0 0 KUBE-SEP-ROWRN2NBYRGLX5XA all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-ROWRN2NBYRGLX5XA side: source mask: 255.255.255.255
0 0 KUBE-SEP-KWMDMS2VX7LWBYQL all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-KWMDMS2VX7LWBYQL side: source mask: 255.255.255.255
0 0 KUBE-SEP-T77HFFV4XKKCYE7Z all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-T77HFFV4XKKCYE7Z side: source mask: 255.255.255.255
0 0 KUBE-SEP-ZWM4VW5XILR6J3V3 all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/kubernetes:https */ recent: CHECK seconds: 10800 reap name: KUBE-SEP-ZWM4VW5XILR6J3V3 side: source mask: 255.255.255.255
the first IP is currently unavailable and we detect error issues when a POD is starting: Get https://100.64.0.1:443/api?timeout=32s: dial tcp 100.64.0.1:443: i/o timeout
Is it possible to optimize the service and shrink the list to the available masters only?
7. Please provide your cluster manifest.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
spec:
additionalPolicies:
node: |
[
{ "Action": "sts:AssumeRole", "Effect": "Allow", "Resource": "*" },
{ "Action": "ec2:AssociateAddress", "Effect": "Allow", "Resource": "*" },
{ "Action": "ec2:AttachVolume", "Effect": "Allow", "Resource": "*" },
{ "Action": "ec2:DetachVolume", "Effect": "Allow", "Resource": "*" },
{ "Action": "ec2:ModifyInstanceAttribute", "Effect": "Allow", "Resource": "*" }
]
api:
dns: {}
authorization:
alwaysAllow: {}
certManager:
enabled: true
channel: stable
cloudProvider: aws
configBase: s3://kops-xxx/xxx
containerRuntime: containerd
dnsZone: xxx
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeDNS:
provider: CoreDNS
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
volumeStatsAggPeriod: 0s
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.21.5
masterInternalName: k8s.internal.xxx
masterPublicName: k8s.xxx
metricsServer:
enabled: true
networkCIDR: 172.31.0.0/16
networkID: xxx
networking:
amazonvpc: {}
nonMasqueradeCIDR: 100.64.0.0/10
subnets:
...
topology:
dns:
type: Public
masters: public
nodes: public
Thank you in advance for any help, Francesco
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 22 (19 by maintainers)
i have the same issue, when i upgrade from 1.21 cluster to 1.22 my kubernetes endpoint have more servers than configured masters.
i want to apply the troubleshoot for delete the masterleases but my etcd-manager-main pod have different certificates configurated:
and the etcdctl connection have errors with this certificates:
finally i found the kube-apiserver certificates in the control-plane machine and the troubleshoot works perfectly
Yes, this should have been fixed in 1.22.2. At least known causes why this happens.
The kube-apiserver certificates are not present in the etcd-manager pods. But you can connect using the certificates should be in /etcd/kubernetes/pki.
@fvasco your issue above got me on the right track to fix my problem. I had been able to reproduce the timeouts but hadn’t quite figured out why yet, as everything looked okay on the cluster and infra.
I had 5 control plane nodes listed in etcd, which is where the kubernetes default svc populates the endpoints.
During my upgrade from 1.21 to 1.22, the etcd upgrade was stuck going to 3.5 and I terminated all 3 masters and let them come back (effectively performing the etcd restore process). This allowed etcd to proceed but then I was hit with the random network failures.
Following the docs to find and delete the additional master leases in etcd resolved the issue for me.