kubeadm: upgrade-1-28-latest and upgrade-addons-before-controlplane-1-28-latest failed

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-28-latest https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-addons-before-controlplane-1-28-latest

keeps failing after https://github.com/kubernetes/release/pull/3254.

/assign

ref https://github.com/kubernetes/kubeadm/issues/2925

See https://github.com/kubernetes/kubeadm/issues/2927#issuecomment-1713870411 for the conclusion.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 71 (71 by maintainers)

Most upvoted comments

The latest diff result(https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-1-28-latest/1701751770478284800/build-log.txt) shows that the default values are no longer injected, this is what we expect.

I0913 00:25:24.641339    3894 staticpods.go:225] Pod manifest files diff:
@@ -46 +45,0 @@
-      successThreshold: 1
@@ -62 +60,0 @@
-      successThreshold: 1
@@ -64,2 +61,0 @@
-    terminationMessagePath: /dev/termination-log
-    terminationMessagePolicy: File
@@ -71,2 +66,0 @@
-  dnsPolicy: ClusterFirst
-  enableServiceLinks: true
@@ -76,2 +69,0 @@
-  restartPolicy: Always
-  schedulerName: default-scheduler
@@ -81 +72,0 @@
-  terminationGracePeriodSeconds: 30

Now we should fix the v1.28 version, and then the CI will be green. Waiting https://github.com/kubernetes/kubernetes/pull/120605 to be merged.

I think after we remove the reference of k8s.io/kubernetes/pkg/apis/core/v1 from the v1.28 branch, everything will be back to normal. Because the Pod defaulter is registered into Scheme by: https://github.com/kubernetes/kubernetes/blob/160fe010f32fd1896917fecad680769ad0e40ca0/pkg/apis/core/v1/register.go#L29-L34

func init() {
        // We only register manually written functions here. The registration of the
        // generated functions takes place in the generated files. The separation
        // makes the code compile even when the generated files are missing.
        localSchemeBuilder.Register(addDefaultingFuncs, addConversionFuncs)
}

That’s why the default values are injected…

some testing from me confirms this is really go version related. defaults will always generated with golang 1.20, so that whatever the k8s.io/kubernetes/pkg/apis/core/v1 is imported or not, diff will always empty as both of them are generated with defaults. defaults will not be generated by golang 1.21 by default, and I suspect kind update the golang recently, so this issue is hit.

shows that the default values are no longer injected, this is what we expect.

some testing in kinder shows the new manifest will not have the defaults generated, but old manifest for each of the pod has the defaults created.

so we are preparing some patches for 1. and 2. (noted above) and we are going to apply them to master and then backport 1. to 1.28.

but our upgrade CI is only failing from 1.28 -> master upgrades. the master upgrade is using the 1.29-pre kubeadm binary. this makes the whole problem very confusing.

logically it should have also failed in the 1.27 -> 1.28 upgrades, because 1.28 is when we added the problem code (extra envs). there could be other factors at play here and we might not understand why this is happening exactly…

1 and 2 merged. we also updated the diff tool. https://github.com/kubernetes/kubernetes/commits/master

the e2e tests are still failing: https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-28-latest

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-upgrade-1-28-latest/1701661172480086016/build-log.txt

this is before the diff tool update, but we are still getting the strange defaulter diff. our deserializer is now not a converting one and we removed the dep on the internal packages…

[upgrade/staticpods] Renewing etcd-server certificate
I0912 18:26:16.593921    2794 staticpods.go:225] Pod manifest files diff:
  &v1.Pod{
  	TypeMeta:   {Kind: "Pod", APIVersion: "v1"},
  	ObjectMeta: {Name: "etcd", Namespace: "kube-system", Labels: {"component": "etcd", "tier": "control-plane"}, Annotations: {"kubeadm.kubernetes.io/etcd.advertise-client-urls": "https://172.17.0.3:2379"}, ...},
  	Spec: v1.PodSpec{
  		Volumes:        {{Name: "etcd-certs", VolumeSource: {HostPath: &{Path: "/etc/kubernetes/pki/etcd", Type: &"DirectoryOrCreate"}}}, {Name: "etcd-data", VolumeSource: {HostPath: &{Path: "/var/lib/etcd", Type: &"DirectoryOrCreate"}}}},
  		InitContainers: nil,
  		Containers: []v1.Container{
  			{
  				... // 11 identical fields
  				VolumeMounts:  {{Name: "etcd-data", MountPath: "/var/lib/etcd"}, {Name: "etcd-certs", MountPath: "/etc/kubernetes/pki/etcd"}},
  				VolumeDevices: nil,
  				LivenessProbe: &v1.Probe{
  					... // 2 identical fields
  					TimeoutSeconds:                15,
  					PeriodSeconds:                 10,
- 					SuccessThreshold:              0,
+ 					SuccessThreshold:              1,
  					FailureThreshold:              8,
  					TerminationGracePeriodSeconds: nil,
  				},
  				ReadinessProbe: nil,
  				StartupProbe: &v1.Probe{
  					... // 2 identical fields
  					TimeoutSeconds:                15,
  					PeriodSeconds:                 10,
- 					SuccessThreshold:              0,
+ 					SuccessThreshold:              1,
  					FailureThreshold:              24,
  					TerminationGracePeriodSeconds: nil,
  				},
  				Lifecycle:                nil,
- 				TerminationMessagePath:   "",
+ 				TerminationMessagePath:   "/dev/termination-log",
- 				TerminationMessagePolicy: "",
+ 				TerminationMessagePolicy: "File",
  				ImagePullPolicy:          "IfNotPresent",
  				SecurityContext:          nil,
  				... // 3 identical fields
  			},
  		},
  		EphemeralContainers:           nil,
- 		RestartPolicy:                 "",
+ 		RestartPolicy:                 "Always",
- 		TerminationGracePeriodSeconds: nil,
+ 		TerminationGracePeriodSeconds: &30,
  		ActiveDeadlineSeconds:         nil,
- 		DNSPolicy:                     "",
+ 		DNSPolicy:                     "ClusterFirst",
  		NodeSelector:                  nil,
  		ServiceAccountName:            "",
  		... // 10 identical fields
  		Subdomain:     "",
  		Affinity:      nil,
- 		SchedulerName: "",
+ 		SchedulerName: "default-scheduler",
  		Tolerations:   nil,
  		HostAliases:   nil,
  		... // 3 identical fields
  		ReadinessGates:     nil,
  		RuntimeClassName:   nil,
- 		EnableServiceLinks: nil,
+ 		EnableServiceLinks: &true,
  		PreemptionPolicy:   nil,
  		Overhead:           nil,
  		... // 6 identical fields
  	},
  	Status: {},
  }

I0912 18:26:16.594089    2794 certs.go:519] validating certificate period for etcd CA certificate
I0912 18:26:16.594901    2794 certs.go:519] validating certificate period for etcd/ca certificate
[upgrade/staticpods] Renewing etcd-peer certificate
[upgrade/staticpods] Renewing etcd-healthcheck-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2023-09-12-18-26-15/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
I0912 18:26:18.742230    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 5 milliseconds
I0912 18:26:19.251245    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 6 milliseconds
I0912 18:26:19.750397    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 4 milliseconds
I0912 18:26:20.249612    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 4 milliseconds
....
...
I0912 18:31:18.249358    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 3 milliseconds
I0912 18:31:18.751030    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 6 milliseconds
I0912 18:31:18.755776    2794 round_trippers.go:553] GET https://172.17.0.7:6443/api/v1/namespaces/kube-system/pods/etcd-kinder-upgrade-control-plane-1?timeout=10s 200 OK in 3 milliseconds
I0912 18:31:18.756932    2794 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.4:2379 https://172.17.0.3:2379 https://172.17.0.2:2379]) are available 1/10
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
[upgrade/etcd] Waiting for previous etcd to become available
[upgrade/etcd] Etcd was rolled back and is now available
static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
	cmd/kubeadm/app/phases/upgrade/staticpods.go:527
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
	cmd/kubeadm/app/phases/upgrade/staticpods.go:256
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
	cmd/kubeadm/app/phases/upgrade/staticpods.go:340
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.StaticPodControlPlane
	cmd/kubeadm/app/phases/upgrade/staticpods.go:467
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.PerformStaticPodUpgrade
	cmd/kubeadm/app/phases/upgrade/staticpods.go:619
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.PerformControlPlaneUpgrade
	cmd/kubeadm/app/cmd/upgrade/apply.go:216
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.runApply
	cmd/kubeadm/app/cmd/upgrade/apply.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.newCmdApply.func1
	cmd/kubeadm/app/cmd/upgrade/apply.go:74
github.com/spf13/cobra.(*Command).execute
	vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
	vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
	vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
	cmd/kubeadm/app/kubeadm.go:50
main.main
	cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:267
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1650
fatal error when trying to upgrade the etcd cluster, rolled the state back to pre-upgrade state
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
	cmd/kubeadm/app/phases/upgrade/staticpods.go:369
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.StaticPodControlPlane
	cmd/kubeadm/app/phases/upgrade/staticpods.go:467
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.PerformStaticPodUpgrade
	cmd/kubeadm/app/phases/upgrade/staticpods.go:619
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.PerformControlPlaneUpgrade
	cmd/kubeadm/app/cmd/upgrade/apply.go:216
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.runApply
	cmd/kubeadm/app/cmd/upgrade/apply.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.newCmdApply.func1
	cmd/kubeadm/app/cmd/upgrade/apply.go:74
github.com/spf13/cobra.(*Command).execute
	vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
	vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
	vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
	cmd/kubeadm/app/kubeadm.go:50
main.main
	cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:267
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1650
[upgrade/apply] FATAL
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.runApply
	cmd/kubeadm/app/cmd/upgrade/apply.go:157
k8s.io/kubernetes/cmd/kubeadm/app/cmd/upgrade.newCmdApply.func1
	cmd/kubeadm/app/cmd/upgrade/apply.go:74
github.com/spf13/cobra.(*Command).execute
	vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
	vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
	vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
	cmd/kubeadm/app/kubeadm.go:50
main.main
	cmd/kubeadm/kubeadm.go:25
runtime.main
	/usr/local/go/src/runtime/proc.go:267
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1650
Error: failed to exec action kubeadm-upgrade: exit status 1

so we are preparing some patches for 1. and 2. (noted above) and we are going to apply them to master and then backport 1. to 1.28.

but our upgrade CI is only failing from 1.28 -> master upgrades. the master upgrade is using the 1.29-pre kubeadm binary. this makes the whole problem very confusing.

logically it should have also failed in the 1.27 -> 1.28 upgrades, because 1.28 is when we added the problem code (extra envs). there could be other factors at play here and we might not understand why this is happening exactly…

2. in our function to reading a Pod manifest we should switch to using codecs.UniversalDeserializer().Decode(...) https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/marshal.go#L57-L78 https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/runtime/serializer/codec_factory.go#L269 TODO

I added the change of point 2: using the universal deserializer to decode in https://github.com/kubernetes/kubernetes/pull/120549.

I took most of the day to investigate this issue, still have some question mark of my mind,

  • why it is failed from 6/9 instead of the time when it’s merged two month ago?
  • why kubeadm can pass with the upgrade locally from the exactly the same init version and target version (defaults are generated for both manifest). There is some change to make the defaults generated, this is somehow happened recently, they use to be not generated, but anyway they should be same with each other unless it was patched, since they are using the same method to generate a pod object.
	pod1, err := ReadStaticPodFromDisk(path1)
	if err != nil {
		return false, err
	}
	pod2, err := ReadStaticPodFromDisk(path2)
	if err != nil {
		return false, err
	}

there is only one change in the apimachinary during the period, I am not sure is that related, anyway, I will follow up and see if we can root cause it.

Sorry, Just back, I am looking into this.

No worry. This is an import-cause problem for types registers. It is hard to find out the root cause at first.

I still stuck to reproduce the problem with the test https://github.com/kubernetes/kubernetes/pull/120549/files#diff-95eb7d0b2fcbe4b0c4be43f325c40d36b8c7d1b21d9f83a4330f2aa799f47846R75. https://github.com/kubernetes/kubernetes/pull/120549/files#diff-95eb7d0b2fcbe4b0c4be43f325c40d36b8c7d1b21d9f83a4330f2aa799f47846R264-R266

  • the generated YAML is without default dnsPolicy.
  • TestCreateLocalEtcdStaticPodManifestFileWithPatches is similar.

After I move the test TestCreateLocalEtcdStaticPodManifestFileWithPatches to cmd/kubeadm/, it will fail like the TestFunc by @neolit123 in the above comments.

https://github.com/kubernetes/kubernetes/compare/master...pacoxu:pacoxu-double-check?expand=1

Add the below import cmd/kubeadm/app/apis/kubeadm/v1beta4/zz_generated.defaults.go will fail the test that was written by @neolit123

	_ "k8s.io/kubernetes/pkg/apis/core/v1"

reproduced it minimally, will post it on slack:

package main

import (
	"testing"

	"github.com/pkg/errors"
	v1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/runtime"
	"k8s.io/apimachinery/pkg/runtime/schema"
	"k8s.io/apimachinery/pkg/runtime/serializer"
	clientsetscheme "k8s.io/client-go/kubernetes/scheme"
	"sigs.k8s.io/yaml"
)

var input = []byte(string(`
apiVersion: v1
kind: Pod
metadata:
  name: foo
`))

func unmarshalFromYamlForCodecs(buffer []byte, gv schema.GroupVersion, codecs serializer.CodecFactory) (runtime.Object, error) {
	const mediaType = runtime.ContentTypeYAML
	info, ok := runtime.SerializerInfoForMediaType(codecs.SupportedMediaTypes(), mediaType)
	if !ok {
		return nil, errors.Errorf("unsupported media type %q", mediaType)
	}

	decoder := codecs.DecoderToVersion(info.Serializer, gv)
	obj, err := runtime.Decode(decoder, buffer)
	if err != nil {
		return nil, errors.Wrapf(err, "failed to decode %s into runtime.Object", buffer)
	}
	return obj, nil
}

func TestFunc(t *testing.T) {
	t.Logf("\ninput:\n%s\n", input)
	obj, err := unmarshalFromYamlForCodecs(input, v1.SchemeGroupVersion, clientsetscheme.Codecs)
	if err != nil {
		t.Fatalf("error: %v\n", err)
	}
	pod := obj.(*v1.Pod)
	// t.Logf("%+v\n", pod)

	output, err := yaml.Marshal(pod)
	if err != nil {
		t.Fatalf("error: %v\n", err)
	}
	t.Logf("\noutput:\n\n%s\n", output)
}

commit fd8f2c7fc65b467b1856dc9a83c06d12bd92e586 (HEAD -> master, origin/master, origin/HEAD)

=== RUN   TestFunc
   ~/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm_test.go:54:
        input:

        apiVersion: v1
        kind: Pod
        metadata:
          name: foo

   ~/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm_test.go:66:
        output:

        apiVersion: v1
        kind: Pod
        metadata:
          creationTimestamp: null
          name: foo
        spec:
          containers: null
          dnsPolicy: ClusterFirst
          enableServiceLinks: true
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
        status: {}

--- PASS: TestFunc (0.01s)
PASS
ok  	k8s.io/kubernetes/cmd/kubeadm	0.076s

commit 3b874af3878ec7769af3c37afba22fc4d232e57e (HEAD -> release-1.27)

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestFunc$ k8s.io/kubernetes/cmd/kubeadm -count=1 -v

=== RUN   TestFunc
  ~/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm_test.go:38:
        input:

        apiVersion: v1
        kind: Pod
        metadata:
          name: foo

    ~/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm_test.go:51:
        output:

        apiVersion: v1
        kind: Pod
        metadata:
          creationTimestamp: null
          name: foo
        spec:
          containers: null
        status: {}

--- PASS: TestFunc (0.01s)
PASS
ok  	k8s.io/kubernetes/cmd/kubeadm	0.072s

I posted this to #api-machinery slack channel https://kubernetes.slack.com/archives/C0EG7JC6T/p1694264129815039. Let me do some generating test locally.