consul-k8s: Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out

Community Note

  • Please vote on this issue by adding a šŸ‘ reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave ā€œ+1ā€ or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Since upgrading the consul-k8s project to 0.33 We are seeing frequent failures within our primary cluster. I have not, yet at least, observed this behavior in other clusters we run. This primary difference with this cluster is that it runs close to 100 nodes and has several connect injected services that are frequently scaling up and down from the horizontal pod autoscaler.

Observed Symptoms

  • new pod starts
  • endpoints controller does not regster the service with consul
  • the consul-connect-inject-init is stuck on Unable to find registered services; retrying
  • the consul agent on the service pods node logs errors due to the service not being registered
[ERROR] agent.http: Request error: method=GET url=/v1/agent/services?filter=Meta%5B%22pod-name%22%5D+%3D%3D+%22example-service-6848b97cd7-fr27c%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22aggregates-service%22 from=10.0.36.110:49094 error="ACL not found"

At this point the pod is stuck in this state. The endpoints controller never actually registers the service. After a few minutes our on call engineers our paged due to the stuck pods. Deleting the pods in this state usually gets things back on track.

  • AWS EKS 1.20 Server Version: v1.20.7-eks-d88609
  • Servers on Consul 1.10.2
  • Agents on 1.10.2

I have confirmed that these pods are present in the service’s endpoints.

Helm Values

fullnameOverride: consul
# Available parameters and their default values for the Consul chart.

global:
  enabled: false
  domain: consul
  image: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/consul:1.10.2"
  imageK8S: "xxxx.dkr.ecr.us-west-2.amazonaws.com/consul-k8s-control-plane:0.33.0"
  imageEnvoy: "xxxxx.dkr.ecr.us-west-2.amazonaws.com/envoy:v1.16.4"
  datacenter: xxxx
  enablePodSecurityPolicies: false
  gossipEncryption:
    secretName: consul-secrets
    secretKey: gossip-encryption-key
  tls:
    enabled: true
    enableAutoEncrypt: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    httpsOnly: false
    caCert:
      secretName: consul-secrets
      secretKey: ca.crt
    caKey:
      secretName: null
      secretKey: null

server:
  enabled: false
externalServers:
  enabled: true
  hosts: [xxxxxxxxxx]
  httpsPort: 443
  tlsServerName: null
  useSystemRoots: true
client:
  enabled: true
  image: null
  join:
    - "provider=aws tag_key=consul-datacenter tag_value=xxxxxx"
  grpc: true
  exposeGossipPorts: false
  resources:
    requests:
      memory: "400Mi"
      cpu: "200m"
    limits:
      cpu: "500m"
      memory: "400Mi"
  extraConfig: |
    {
      "telemetry": {
        "disable_hostname": true,
        "prometheus_retention_time": "6h"
      }
    }
  extraVolumes:
    - type: secret
      name: consul-secrets
      load: false
    - type: secret
      name: consul-acl-config
      load: true 
  tolerations: ""
  nodeSelector: null
  annotations: null
  extraEnvironmentVars:
    CONSUL_HTTP_TOKEN_FILE: /consul/userconfig/consul-secrets/consul.token

dns:
  enabled: true

ui:
  enabled: false

syncCatalog:
  enabled: true
  image: null
  default: true # true will sync by default, otherwise requires annotation
  toConsul: true
  toK8S: false
  k8sPrefix: null
  consulPrefix: null
  k8sTag: k8s-cluster-name
  syncClusterIPServices: true
  nodePortSyncType: ExternalFirst
  aclSyncToken:
    secretName: consul-secrets
    secretKey: consul-k8s-sync.token

connectInject:
  enabled: true
  replicas: 2
  default: false
  resources:
    requests:
      memory: "500Mi"
      cpu: "100m"
    limits:
      cpu: null
      memory: "750Mi"
  overrideAuthMethodName: kubernetes
  aclInjectToken:
    secretName: consul-secrets
    secretKey: connect-inject.token
  centralConfig:
    enabled: true
  sidecarProxy:
    resources:
      requests:
        memory: 150Mi
        cpu: 100m
      limits:
        memory: 150Mi

I could use some guidance on what information would be most useful to help debug this. Thanks for your help!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 58 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I believe this issue may be caused by https://github.com/hashicorp/consul-k8s/issues/779. Our environment auto scales quite frequently and there are often scenarios where a node that has gone away will still show up in the Consul member list.

I deployed 0.40 with https://github.com/hashicorp/consul-k8s/pull/991 into a staging cluster last week and things are looking better thus far. I have expanded the test by deploying 0.40 to a lower traffic production cluster. I’ll update the issue after observing behavior further.

Yeah for sure; this is a high priority for us to fix now.

@raphlcx the one issue with ā€œignore pods becauseā€¦ā€ is here: https://github.com/hashicorp/consul-k8s/issues/1143. I have a fix for that.

@lkysow Thanks, looking forward to adopt the fix!

Regarding the container failing its liveness probe that’s unexpected. Is there anything in the events?

We have managed to resolve the root cause of connect-injector pod crashing. It was due to allocating insufficient memory to the pod. The default memory allocation for each pod is 50Mi but the pod was actually using up to 70Mi and that led to frequent crashes.

After bumping the memory allocation to 100Mi, it resolved the crash and we don’t get liveness probe errors anymore in the events.

  1. This is a rough estimate given it is the pod count of connect services at this specific time
     278

We have 53 connect services total on the mesh. 8 of those run directly on ec2 outside of Kubernetes.

  1. We autoscale both our the underlying nodes and the pods on top of them quite frequently.
image image

We never experience behavior like this with consul-k8s 0.25. I have tried every version of consul-k8s project since 0.33 and have run into these problems in production every time. I have almost never seen this behavior in our pre-prod environments which have been using the latest consul-k8s starting with 0.33 up to 0.42 now. The only difference between them is the scale. The workloads get a lot more traffic and autoscale to handle spikes.

Our consul cluster is usually around 144 nodes. The raft commit time is between 10 and 25ms at all times. Consul usually has 10-40 catalog operations a second. We run 5 consul servers on c6g.xlarge instances. CPU utilization has peaked at 10% on the leader.

I’m happy to provide any other helpful info. I’d love to not rely on restarting the inject controller every 3 minutes to keep things working 😃

In production I run the endpoints controller with

resources:
    requests:
      memory: "1000Mi"
      cpu: "200m"
    limits:
      cpu: null # the null removes the default cpu limit present in the helm chart.
      memory: "1500Mi"

I haven’t had any issues with the controller pods crashing and the have never been closing to using the amount of memory I’ve allocated for them. That sounds like a different issue than what I’ve been seeing.

I may have to downgrade the production cluster I was testing this upgrade in. The on call engineer was paged again due to pods stuck an init state.

NAME                                          READY   STATUS     RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
service-86dd585d4-q4785           0/3     Init:1/2   0          20s   10.0.77.65    ip-10-0-94-66.us-west-1.compute.internal    <none>           0/1
service-86dd585d4-xb5ng           0/3     Init:1/2   4          11m   10.0.86.135   ip-10-0-64-120.us-west-1.compute.internal   <none>           0/1
kubectl logs service-86dd585d4-9x25x -c consul-connect-inject-init    
2022-02-01T17:48:10.339Z [INFO]  Consul login complete
2022-02-01T17:48:10.339Z [INFO]  Checking that the ACL token exists when reading it in the stale consistency mode
2022-02-01T17:48:10.340Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-02-01T17:48:10.441Z [INFO]  Successfully read ACL token from the server
2022-02-01T17:48:10.443Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:11.444Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:12.445Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:13.447Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:14.448Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:15.450Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:16.451Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:17.452Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:18.453Z [INFO]  Unable to find registered services; retrying
2022-02-01T17:48:19.455Z [INFO]  Unable to find registered services; retrying

The pod ids service-86dd585d4-q4785 and service-86dd585d4-xb5ng do not appear anywhere in the inject controller logs. The consul agent is healthy on both of those nodes. I do see the following in the consul agent logs on those nodes

2022-02-01T17:50:28.013Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.77.65:33428 error="ACL not found"
2022-02-01T17:49:16.015Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.86.135:52496 error="ACL not found"
2022-02-01T17:52:38.993Z [ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.0.86.135:45138 error="ACL not found"
$ kubectl get events | grep service-86dd585d4-xb5ng                     
18m         Normal    Scheduled                pod/service-86dd585d4-xb5ng               Successfully assigned service/service-86dd585d4-xb5ng to ip-10-0-64-120.us-west-1.compute.internal
18m         Normal    Pulled                   pod/service-86dd585d4-xb5ng               Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/consul:1.10.7" already present on machine
18m         Normal    Created                  pod/service-86dd585d4-xb5ng               Created container copy-consul-bin
18m         Normal    Started                  pod/service-86dd585d4-xb5ng               Started container copy-consul-bin
9m12s       Normal    Pulled                   pod/service-86dd585d4-xb5ng               Container image "063772531253.dkr.ecr.us-west-1.amazonaws.com/consul-k8s-control-plane:0.40.0" already present on machine
9m12s       Normal    Created                  pod/service-86dd585d4-xb5ng               Created container consul-connect-inject-init
9m12s       Normal    Started                  pod/service-86dd585d4-xb5ng               Started container consul-connect-inject-init
6m42s       Warning   BackOff                  pod/service-86dd585d4-xb5ng               Back-off restarting failed container
18m         Normal    SuccessfulCreate         replicaset/service-86dd585d4              Created pod: service-86dd585d4-xb5ng

Eventually pod service-86dd585d4-xb5ng started but the delay between the pod starting and it become healthy is way too long.

The only thing jumping out at me in the injector logs is this

2022-02-01T17:38:15.443Z	INFO	controller.endpoints	Consul client agent is not ready, skipping deregistration	{"consul-agent": "consul-client-rrfww", "svc": "service"}
2022-02-01T17:38:15.459Z	ERROR	controller.endpoints	Reconciler error	{"reconciler group": "", "reconciler kind": "Endpoints", "name": "service", "namespace": "service", "error": "1 error occurred:\n\t* Put \"https://10.0.66.169:8501/v1/agent/service/register\": dial tcp 10.0.66.169:8501: connect: connection refused\n\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-02-01T17:38:19.157Z	INFO	controller.endpoints	retrieved	{"name": "fluent-bit", "ns": "fluentd"}
2022-02-01T17:38:19.157Z	INFO	controller.endpoints	ignoring because endpoints pods have not been injected	{"name": "fluent-bit", "ns": "fluentd"}
2022-02-01T17:38:35.940Z	INFO	controller.endpoints	retrieved	{"name": "service", "ns": "service"}
2022-02-01T17:38:35.940Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-dxwnr-service", "agentIP": "10.0.94.60"}
2022-02-01T17:38:35.942Z	INFO	controller.endpoints	registering proxy service with Consul	{"name": "service-sidecar-proxy"}
2022-02-01T17:38:35.945Z	INFO	controller.endpoints	updating health check status for service	{"name": "service", "reason": "Kubernetes health checks passing", "status": "passing"}
2022-02-01T17:38:35.946Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-r6vdl-service", "agentIP": "10.0.72.103"}
2022-02-01T17:38:35.949Z	INFO	controller.endpoints	registering proxy service with Consul	{"name": "service-sidecar-proxy"}
2022-02-01T17:38:35.951Z	INFO	controller.endpoints	updating health check status for service	{"name": "service", "reason": "Kubernetes health checks passing", "status": "passing"}
2022-02-01T17:38:35.952Z	INFO	controller.endpoints	registering service with Consul	{"name": "service", "id": "service-86dd585d4-9x25x-service", "agentIP": "10.0.66.169"}
2022-02-01T17:38:35.955Z	ERROR	controller.endpoints	failed to register service	{"name": "service", "error": "Unexpected response code: 403 (Permission denied: Missing service:write on service)"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-02-01T17:38:35.955Z	ERROR	controller.endpoints	failed to register services or health check	{"name": "service", "ns": "service", "error": "Unexpected response code: 403 (Permission denied: Missing service:write on service)"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227

This error 2022-02-01T17:38:15.459Z ERROR controller.endpoints Reconciler error {"reconciler group": "", "reconciler kind": "Endpoints", "name": "service", "namespace": "service", "error": "1 error occurred:\n\t* Put \"https://10.0.66.169:8501/v1/agent/service/register\": dial tcp 10.0.66.169:8501: connect: connection refused\n\n"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227 is for a node that no longer exists ip-10-0-66-169.us-west-1.compute.internal 10.0.86.68:8301 left client 1.10.7 2 datacenter default <default>

I had to go ahead and downgrade our prod clusters. Next week I’ll try and reproduce in a test cluster.