kubeadm: kubeadm upgrade fails when hostname != node name and when kubeadm config is used
This is a followup for the https://github.com/kubernetes/kubeadm/issues/1757
What keywords did you search in kubeadm issues before filing this one?
upgrade kubeadm hostname
If you have found any duplicates, you should instead reply there and close this page.
If you have not found any duplicates, delete this section and continue on.
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use kubeadm version): kubeadm version: &version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:01Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: bare metal
- OS (e.g. from /etc/os-release): ubuntu:bionic
- Kernel (e.g.
uname -a):Linux hq-srv11 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Others:
What happened?
During the kubeadm upgrade apply v1.16.0 --config=/path/to/config.yaml --dry-run run it ended up with an infinite loop of
[dryrun] Resource name: "hq-srv11"
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
and the same with more verbose output:
I0924 20:42:22.238048 16521 patchnode.go:30] [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "hq-srv11" as an annotation
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
I0924 20:42:22.738465 16521 round_trippers.go:423] curl -k -v -XGET -H "Accept: application/json" -H "User-Agent: kubeadm/v1.16.0 (linux/amd64) kubernetes/2bd9643" 'https://10.50.8.1:6443/api/v1/nodes/hq-srv11'
I0924 20:42:22.742984 16521 round_trippers.go:443] GET https://10.50.8.1:6443/api/v1/nodes/hq-srv11 404 Not Found in 4 milliseconds
I0924 20:42:22.743053 16521 round_trippers.go:449] Response Headers:
I0924 20:42:22.743110 16521 round_trippers.go:452] Content-Type: application/json
I0924 20:42:22.743126 16521 round_trippers.go:452] Content-Length: 186
I0924 20:42:22.743142 16521 round_trippers.go:452] Date: Tue, 24 Sep 2019 20:42:22 GMT
I0924 20:42:22.743183 16521 request.go:968] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes \"hq-srv11\" not found","reason":"NotFound","details":{"name":"hq-srv11","kind":"nodes"},"code":404}
[dryrun] The GET request didn't yield any result, the API Server returned a NotFound error.
[dryrun] Would perform action GET on resource "nodes" in API group "core/v1"
[dryrun] Resource name: "hq-srv11"
I traced back to see where that value comes from and found the source of the problem:
func SetJoinDynamicDefaults(cfg *kubeadmapi.JoinConfiguration) error {
addControlPlaneTaint := false
if cfg.ControlPlane != nil {
addControlPlaneTaint = true
}
if err := SetNodeRegistrationDynamicDefaults(&cfg.NodeRegistration, addControlPlaneTaint); err != nil {
return err
}
return SetJoinControlPlaneDefaults(cfg.ControlPlane)
}
// SetNodeRegistrationDynamicDefaults checks and sets configuration values for the NodeRegistration object
func SetNodeRegistrationDynamicDefaults(cfg *kubeadmapi.NodeRegistrationOptions, ControlPlaneTaint bool) error {
var err error
cfg.Name, err = kubeadmutil.GetHostname(cfg.Name)
if err != nil {
return err
}
// GetHostname returns OS's hostname if 'hostnameOverride' is empty; otherwise, return 'hostnameOverride'
// NOTE: This function copied from pkg/util/node package to avoid external kubeadm dependency
func GetHostname(hostnameOverride string) (string, error) {
hostName := hostnameOverride
if len(hostName) == 0 {
nodeName, err := os.Hostname()
if err != nil {
return "", errors.Wrap(err, "couldn't determine hostname")
}
hostName = nodeName
}
As you can see unless you specify it explicitly - the os.Hostname is used, and the hostname of the machine is hq-srv11:
# hostname
hq-srv11
# hostname -f
hq-srv11.<redacted-org-domain-name>
while nodes in the cluster have the explicitly set FQDN
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
hq-srv11.<redacted-org-domain-name> Ready master 63d v1.15.3
What you expected to happen?
I believe the name of the node should be obtained from the API, or at least correlated with what’s in the API, since hostname not necessary matches the node name.
How to reproduce it (as minimally and precisely as possible)?
Initialise an older version cluster with a node with non-default name and with a kubeadm confid using kubeadm init --node-name=foo, then upgrade, using the kubeadm config again.
Anything else we need to know?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 25 (20 by maintainers)
exactly, that is because we don’t want users to use it.
the
--configflag was added to upgrade to allow reconfiguration of the existing cluster, which is now supported using the kubeadm kustomize feature (see the changelog for 1.16). yet, reconfiguring the cluster using this flag is not recommended.i agree. this needs a line or two in this document: https://github.com/kubernetes/website/blob/master/content/en/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade.md
/kind documentation /assign
ok, so i did some investigation here.
@zerkms your suggestion to use certificates to fetch the node name is already used actually, but only when the user is not providing a config file and and the configuration is fetched from the cluster. see https://github.com/kubernetes/kubernetes/blob/2e6b073a3f800654ec217e763fcb97412308a9db/cmd/kubeadm/app/util/config/cluster.go#L113
this is like so because the dynamic defaulting of node name from certficates happens only for nodes that have the kubelet config and certificates present already and a configuration is fetched from the cluster. if you pass a configuration file kubeadm will default the node name to your hostname. this is by design.
dynamically defaulting your node name to a value from the kubelet and certificates when already passing
--configtoapplyis an option, but i don’t think we should do this.the explicit flag that @SataQiu added is workaround for your use case. there is a similar flag for CRI socket. but i’m personally not in favor of adding more flags.
your existing workaround is to have such a config:
my question for you @zerkms is why are you passing
--configtoapply? this acts like reconfiguration and while kubeadm supports it, it should not be done in the first place. if your config is missing important information it will be defaulted with dynamic values, such as the host name of the node.Yeah, that might be one way. We can extract the host name from the certificate. But I’m not quite sure. @zerkms @neolit123
Thanks @zerkms I have reproduced the problem through the following steps:
I’m going to dig into how do we solve this problem.
i will try to reproduce this again tomorrow.