kubeadm: Regression - etcd datadir permissions not set on etcd grow

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.14+

Environment: N/A

What happened?

With the release of etcd 3.4.10, the datadir permissions now need to be 0700 or etcd won’t start. There was an issue (#1308) where perms were set on join before starting the etcd container as a security control, overriding the default behavior of creating a non-existant directory mode 0755. However, in a cleanup, that necessary os.mkdirall was removed. This was transparently ignored for several releases since etcd didn’t complain, but with etcd-io/etcd#11798 (in 3.4.10), the new etcd cluster on the second node does not start.

I’m pretty sure this will break anyone on k8s 1.14 or newer who upgrades to etcd 3.4.10 or newer without first fixing the /var/lib/etcd perms.

What you expected to happen?

/var/lib/etcd (or whatever the var is set to) should be set to 0700. 😃

How to reproduce it (as minimally and precisely as possible)?

Join a second master node, then ls -ld /var/lib/etcd on the node. With an etcd 3.4.10 or newer runtime

Anything else we need to know?

It’s worth explicitly noting that the first control plane node added works fine. It’s just the second and subsequent nodes which were handled in a separate location in the code which exhibit the problem.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 26 (18 by maintainers)

Most upvoted comments

actually, I think @jingyih is working on resolving the regression in etcd … not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

I discussed with @spzala about this. We are thinking about providing a warning message instead of enforcing the file permission: https://github.com/etcd-io/etcd/pull/12242

updated the PR to only create the directory if it does not exist on init/join-control-plane, but not chmod it. https://github.com/kubernetes/kubernetes/pull/94102

/remove-priority important-soon /priority backlog lowering priority since the fix in etcd was applied and 1.20 will include 3.4.13+. (there is also discussion to backport the etcd version to 1.19)

actually, I think @jingyih is working on resolving the regression in etcd … not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved

ok, in the above PR i’ve added a chmod 700 in kubeadm init even if the directory already exists, just in case.

@neolit123 We are currently using kubeadm / k8s 1.18.8. We use an external etcd cluster which we provisioned ourselves on dedicated localstorage vm’s. We added the etcd cluster to kubeadm via the kubeadm-config configmap. I was able to upgrade the etcd cluster after changing the directory permissions of

/var/lib/etcd

to 700

ok, so originally a similar fix was added here in the function that creates the static pods for 1.14-pre: https://github.com/kubernetes/kubernetes/commit/836f413cf1096c9b020b20319d0767aee4f9b990#diff-c4574f3918f016aeb3b32f5d9cb62ed6

later that code was moved (by ereslibre): https://github.com/kubernetes/kubernetes/commit/981bf1930c73a7d95bbbd1dc9b3bfff122ad09a8

and then this refactor that you linked indeed omitted it. https://github.com/kubernetes/kubernetes/pull/73452 that PR was very noisy and had 140+ comments and we might have missed this.

so yes, we should not let the kubelet create the path with 755, and include the following:

// pre-create the etcd data directory with the right permissions
if err := os.MkdirAll(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to create etcd directory %q", cfg.Etcd.Local.DataDir)
}
// if the directory already existed and the above call was a no-op, ensure it has the right permissions
// or otherwise etcd 3.4.10+ will fail:
// https://github.com/etcd-io/etcd/pull/11798
if err := os.Chmod(cfg.Etcd.Local.DataDir, 0700); err != nil {
	return errors.Wrapf(err, "failed to chmod etcd directory %q", cfg.Etcd.Local.DataDir)
}

at this line: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L133

thanks for reporting it.