kubernetes: Unable to update LoadBalancer's TargetGroup: DuplicateTargetGroupName

What happened:

kube-controller-manager was unable to update an AWS LoadBalancer’s TargetGroup.

What you expected to happen:

TargetGroup is updated without error when hosts change.

How to reproduce it (as minimally and precisely as possible):

This has only happened twice so far with only one of many LoadBalancer services.

Anything else we need to know?:

We encountered this error while troubleshooting an issue with this service’s connectivity:

I1120 02:58:23.948688       1 event.go:281] Event(v1.ObjectReference{Kind:"Service", Namespace:"tram-ingress", Name:"traefik-public", UID:"a9044ddf-4ac6-4a7f-9292-5e78943ca390", APIVersion:"v1", ResourceVersion:"8005791693", FieldPath:""}): type: 'Warning' reason: 'SyncLoadBalancerFailed' Error syncing load balancer: failed to ensure load balancer: error creating load balancer target group: "DuplicateTargetGroupName: A target group with the same name 'k8s-tramingr-traefikp-ca55f6909f' exists, but with different settings\n\tstatus code: 400, request id: 1b252f46-8897-421a-907e-127756658c08"

The error apparently has been occurring repeatedly several times a minute for over a week for this one service.

I was able to resolve the error by manually removing the listener from the load balancer in the AWS console, then deleting the target group. The target group I removed was of the name listed in the error above, but contained no targets. The new target group, created and added as a listener automatically, contained the full correct list of instances.

Later in the evening after resolving the error above we made a NodePort change to the service. The same error message reappeared and we had to manually workaround it a second time.

As of yesterday we have been rolling all ~150 nodes in our cluster. We expect the node roll and host updates to continue for several more hours today. We currently have 1588 service type LoadBalancer in this cluster. No other services have had this problem.

Environment:

Kubernetes version (use kubectl version): v1.17.13
Cloud provider or hardware configuration: AWS
OS (e.g: cat /etc/os-release): FCOS 32.20201018.3.0
Kernel (e.g. uname -a): 5.8.15-201.fc32.x86_64

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 17 (7 by maintainers)

Most upvoted comments

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot on Dec 9, 2021

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot on Sep 10, 2021