kubernetes: 1.10.x upgrade causes api-server to consume a lot of resources and eventually oom
/kind bug /sig api-machinery
What happened:
Upgrade from 1.9 (either 1.9.3 or 1.9.6) to 1.10 (either 1.10.0 or 1.10.1) - after a few hours of running, api server, starts throttling cpu at 100% and consuming more and more memory. Only error logs:
E0413 12:45:48.549420 1 authentication.go:63] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, Token has been invalidated]]
E0413 12:45:48.549680 1 errors.go:90] no context found for request
Other symptoms are:
- slow api server responses
- nodes becoming
NotReady
What you expected to happen:
Things to work.
How to reproduce it (as minimally and precisely as possible):
Update version of hyperkube 1.9 > 1.10
Anything else we need to know?:
Environment:
etcd: v3.3.2
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1+coreos.0", GitCommit:"baafb306bb191971a84cb1796420d093de7e6014", GitTreeState:"clean", BuildDate:"2018-04-12T21:14:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
AWS 3 etcd instances 3 masters in ASG 3 workers in ASG
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1688.5.3
VERSION_ID=1688.5.3
BUILD_ID=2018-04-03-0547
PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Linux ip-10-66-23-26 4.14.32-coreos #1 SMP Tue Apr 3 05:21:26 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux
Our api-server pod definition: https://github.com/utilitywarehouse/tf_kube_ignition/blob/master/resources/kube-apiserver.yaml
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 4
- Comments: 24 (20 by maintainers)
Commits related to this issue
- Merge pull request #64153 from liggitt/automated-cherry-pick-of-#61459-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #61459: etcd client add dial timeout Cherry ... — committed to kubernetes/kubernetes by deleted user 6 years ago
- Merge pull request #64153 from liggitt/automated-cherry-pick-of-#61459-upstream-release-1.10 Automatic merge from submit-queue. Automated cherry pick of #61459: etcd client add dial timeout Cherry ... — committed to kubernetes/apiserver by k8s-publishing-bot 6 years ago
opened https://github.com/kubernetes/kubernetes/pull/64153 to pick back to 1.10.x
#64153 is merged, will be in 1.10.4, planned for 6/6