consul-k8s: Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out
Community Note
- Please vote on this issue by adding a š reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave ā+1ā or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
Since upgrading the consul-k8s project to 0.33 We are seeing frequent failures within our primary cluster. I have not, yet at least, observed this behavior in other clusters we run. This primary difference with this cluster is that it runs close to 100 nodes and has several connect injected services that are frequently scaling up and down from the horizontal pod autoscaler.
Observed Symptoms
- new pod starts
- endpoints controller does not regster the service with consul
- the consul-connect-inject-init is stuck on
Unable to find registered services; retrying - the consul agent on the service pods node logs errors due to the service not being registered
[ERROR] agent.http: Request error: method=GET url=/v1/agent/services?filter=Meta%5B%22pod-name%22%5D+%3D%3D+%22example-service-6848b97cd7-fr27c%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22aggregates-service%22 from=10.0.36.110:49094 error="ACL not found"
At this point the pod is stuck in this state. The endpoints controller never actually registers the service. After a few minutes our on call engineers our paged due to the stuck pods. Deleting the pods in this state usually gets things back on track.
- AWS EKS 1.20
Server Version: v1.20.7-eks-d88609 - Servers on Consul 1.10.2
- Agents on 1.10.2
I have confirmed that these pods are present in the serviceās endpoints.
Helm Values
fullnameOverride: consul
# Available parameters and their default values for the Consul chart.
global:
enabled: false
domain: consul
image: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/consul:1.10.2"
imageK8S: "xxxx.dkr.ecr.us-west-2.amazonaws.com/consul-k8s-control-plane:0.33.0"
imageEnvoy: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/envoy:v1.16.4"
datacenter: xxxx
enablePodSecurityPolicies: false
gossipEncryption:
secretName: consul-secrets
secretKey: gossip-encryption-key
tls:
enabled: true
enableAutoEncrypt: true
serverAdditionalDNSSANs: []
serverAdditionalIPSANs: []
verify: true
httpsOnly: false
caCert:
secretName: consul-secrets
secretKey: ca.crt
caKey:
secretName: null
secretKey: null
server:
enabled: false
externalServers:
enabled: true
hosts: [xxxxxxxxxx]
httpsPort: 443
tlsServerName: null
useSystemRoots: true
client:
enabled: true
image: null
join:
- "provider=aws tag_key=consul-datacenter tag_value=xxxxxx"
grpc: true
exposeGossipPorts: false
resources:
requests:
memory: "400Mi"
cpu: "200m"
limits:
cpu: "500m"
memory: "400Mi"
extraConfig: |
{
"telemetry": {
"disable_hostname": true,
"prometheus_retention_time": "6h"
}
}
extraVolumes:
- type: secret
name: consul-secrets
load: false
- type: secret
name: consul-acl-config
load: true
tolerations: ""
nodeSelector: null
annotations: null
extraEnvironmentVars:
CONSUL_HTTP_TOKEN_FILE: /consul/userconfig/consul-secrets/consul.token
dns:
enabled: true
ui:
enabled: false
syncCatalog:
enabled: true
image: null
default: true # true will sync by default, otherwise requires annotation
toConsul: true
toK8S: false
k8sPrefix: null
consulPrefix: null
k8sTag: k8s-cluster-name
syncClusterIPServices: true
nodePortSyncType: ExternalFirst
aclSyncToken:
secretName: consul-secrets
secretKey: consul-k8s-sync.token
connectInject:
enabled: true
replicas: 2
default: false
resources:
requests:
memory: "500Mi"
cpu: "100m"
limits:
cpu: null
memory: "750Mi"
overrideAuthMethodName: kubernetes
aclInjectToken:
secretName: consul-secrets
secretKey: connect-inject.token
centralConfig:
enabled: true
sidecarProxy:
resources:
requests:
memory: 150Mi
cpu: 100m
limits:
memory: 150Mi
I could use some guidance on what information would be most useful to help debug this. Thanks for your help!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 58 (19 by maintainers)
Commits related to this issue
- Add support for ingress gateway CRD (#714) — committed to lawliet89/consul-k8s by lkysow 4 years ago
- Adding GH-714 as bug fix to change log. — committed to hashicorp/consul-k8s by jmurret 2 years ago
- Adding GH-714 as bug fix to change log. (#1219) * Adding GH-714 as bug fix to change log. * Update CHANGELOG.md Co-authored-by: Luke Kysow <1034429+lkysow@users.noreply.github.com> Co-author... — committed to hashicorp/consul-k8s by jmurret 2 years ago
I believe this issue may be caused by https://github.com/hashicorp/consul-k8s/issues/779. Our environment auto scales quite frequently and there are often scenarios where a node that has gone away will still show up in the Consul member list.
I deployed 0.40 with https://github.com/hashicorp/consul-k8s/pull/991 into a staging cluster last week and things are looking better thus far. I have expanded the test by deploying 0.40 to a lower traffic production cluster. Iāll update the issue after observing behavior further.
Yeah for sure; this is a high priority for us to fix now.
@lkysow Thanks, looking forward to adopt the fix!
We have managed to resolve the root cause of connect-injector pod crashing. It was due to allocating insufficient memory to the pod. The default memory allocation for each pod is 50Mi but the pod was actually using up to 70Mi and that led to frequent crashes.
After bumping the memory allocation to 100Mi, it resolved the crash and we donāt get liveness probe errors anymore in the events.
We have 53 connect services total on the mesh. 8 of those run directly on ec2 outside of Kubernetes.
We never experience behavior like this with consul-k8s 0.25. I have tried every version of consul-k8s project since 0.33 and have run into these problems in production every time. I have almost never seen this behavior in our pre-prod environments which have been using the latest consul-k8s starting with 0.33 up to 0.42 now. The only difference between them is the scale. The workloads get a lot more traffic and autoscale to handle spikes.
Our consul cluster is usually around 144 nodes. The raft commit time is between 10 and 25ms at all times. Consul usually has 10-40 catalog operations a second. We run 5 consul servers on c6g.xlarge instances. CPU utilization has peaked at 10% on the leader.
Iām happy to provide any other helpful info. Iād love to not rely on restarting the inject controller every 3 minutes to keep things working š
In production I run the endpoints controller with
I havenāt had any issues with the controller pods crashing and the have never been closing to using the amount of memory Iāve allocated for them. That sounds like a different issue than what Iāve been seeing.
I may have to downgrade the production cluster I was testing this upgrade in. The on call engineer was paged again due to pods stuck an init state.
The pod ids
service-86dd585d4-q4785andservice-86dd585d4-xb5ngdo not appear anywhere in the inject controller logs. The consul agent is healthy on both of those nodes. I do see the following in the consul agent logs on those nodesEventually pod service-86dd585d4-xb5ng started but the delay between the pod starting and it become healthy is way too long.
The only thing jumping out at me in the injector logs is this
This error
2022-02-01T17:38:15.459Z ERROR controller.endpoints Reconciler error {"reconciler group": "", "reconciler kind": "Endpoints", "name": "service", "namespace": "service", "error": "1 error occurred:\n\t* Put \"https://10.0.66.169:8501/v1/agent/service/register\": dial tcp 10.0.66.169:8501: connect: connection refused\n\n"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227is for a node that no longer existsip-10-0-66-169.us-west-1.compute.internal 10.0.86.68:8301 left client 1.10.7 2 datacenter default <default>I had to go ahead and downgrade our prod clusters. Next week Iāll try and reproduce in a test cluster.