kubernetes: AWS PV Storage Causing API Timeouts.
Kubernetes version 1.3.8 <
It has been reported that this bug is in the 1.4.0 branch and above. I have validated on 1.4.7 and 1.4.6. We have not validated the 1.5.x branch, and an exponential back off patch just was just merged. https://github.com/kubernetes/kubernetes/pull/38766, but we do not beleive that is addressing the root problem.
Environment
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Debian
- Kernel (e.g.
uname -a): Custom maintained kernel by aws team Linux 4.4.26-k8s #1 SMP Fri Oct 21 05:21:13 UTC 2016 x86_64 GNU/Linux - Install tools: kops
- Others:
What happened:
As you add more PV storage to a cluster, the cluster starts to spam the API heavily. It has been reported that only 20 PV attached will start causing EC2 API timeouts. This is a showstopper issue which is making PV attached storage unusable in AWS. As the AWS account nears its limit on API calls this problem cascades, to the point that retries flood the API. One of our accounts is at around 24k calls per hour.
The controller starts to make API calls at such high rate that it starts to retry, and then you just spam the API. The controller is making far too many API calls to validate that a node exists and that a volume is attached to a node. The specific call that is timing out the most is DescribeInstances.
The cluster is at steady state. We do not have volume churn, i.e. we are not adding and removing volumes.
What you expected to happen:
Able to have 500 PV attached to a cluster and not kill EC2 API.
How to reproduce it (as minimally and precisely as possible):
- Use kops to create a cluster. https://github.com/kubernetes/kops/blob/master/docs/aws.md (Don’t use kube-up.sh it is super flakey and EOL)
- Update the master controller to
-v 11, and restart the controller. - Enable Cloud Trails
- Install 20 deployments with 20 attached PVC.
You will get:
I0106 03:00:38.501732 7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.502157 7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.566153 7 log_handler.go:33] AWS request: ec2 DescribeInstances
I0106 03:00:38.593830 7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2c/vol-0860881328c9f524c"/Node "ip-172-20-15-41.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593869 7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2a/vol-0a5c43d0a3ba5501e"/Node "ip-172-20-5-129.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593882 7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-08f2d5e73460c5f1f"/Node "ip-172-20-9-218.us-west-2.compute.internal" is attached--touching.
I0106 03:00:38.593895 7 reconciler.go:195] Volume "kubernetes.io/aws-ebs/aws://us-west-2b/vol-0d8875331a33b8223"/Node "ip-172-20-9-218.us-west-2.compute.internal" is attached--touching.
Showing up in the logs of the controller. I have only tested in HA, but will validate in single master setup shortly.
Anything else do we need to know:
This is bad enough issue that we are crashing controllers.
AWS is rate limited, by API call, and by failed API calls. Some API calls are limited to a region, and some are account wide.
The code that is getting called is from here:
TLDR;
Anyone that is using 1.4.0+ in AWS will exceed their rate limits with as little as 20 PV attached to a cluster. We have one account that is at about 24k API calls per hour because of timeouts. This makes PV unuable on AWS.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 11
- Comments: 44 (42 by maintainers)
Commits related to this issue
- Merge pull request #39551 from chrislovecnm/reconciler-time-increases Automatic merge from submit-queue (batch tested with PRs 39628, 39551, 38746, 38352, 39607) Increasing times on reconciling volu... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #39842 from gnufied/fix-aws-2x-calls Automatic merge from submit-queue (batch tested with PRs 39625, 39842) AWS: Remove duplicate calls to DescribeInstance during volume operation... — committed to kubernetes/kubernetes by deleted user 7 years ago
- Merge pull request #39842 from gnufied/fix-aws-2x-calls Automatic merge from submit-queue (batch tested with PRs 39625, 39842) AWS: Remove duplicate calls to DescribeInstance during volume operation... — committed to nckturner/aws-cloud-controller-manager by deleted user 7 years ago
Any way we can get this merged to v1.5.3? We have many PVCs and our AWS API gets jammed by this bug.
Plan sounds great to me!