kubeflow: [profile-controller] profiles CRD does not preserve unknown fields in plugins.spec
Higher KF 1.3 distribution issue: https://github.com/kubeflow/manifests/issues/1798
UPDATE 4.8
We identified root cause, explained in https://github.com/kubeflow/kubeflow/issues/5813#issuecomment-815450791. This issue affects both GCP and AWS plugins and should block the release.
Previous status
Currently we are debugging an issue on GCP when profile controller attempt to set IAM policy to a profile, it will crash because of strange log.
Error message
2021-04-06 20:03:30.655 PDT2021-04-07T03:03:30.655Z INFO controllers.Profile Start to Reconcile. {"profile": "/kubeflow-user", "namespace": "", "name": "kubeflow-user"}
Error
2021-04-06 20:03:30.655 PDT2021-04-07T03:03:30.655Z INFO controllers.Profile List of labels to be added to namespace {"profile": "/kubeflow-user", "labels": {"app.kubernetes.io/part-of":"kubeflow-profile","istio.io/rev":"asm-192-1","katib-metricscollector-injection":"enabled","pipelines.kubeflow.org/enabled":"true","serving.kubeflow.org/inferenceservice":"enabled"}}
Error
2021-04-06 20:03:30.655 PDT2021-04-07T03:03:30.655Z INFO controllers.Profile Updating Istio AuthorizationPolicy {"profile": "kubeflow-user", "namespace": "kubeflow-user", "name": "ns-owner-access-istio"}
Error
2021-04-06 20:03:30.673 PDT2021-04-07T03:03:30.672Z INFO controllers.Profile Updating RoleBinding {"profile": "kubeflow-user", "namespace": "kubeflow-user", "name": "namespaceAdmin"}
Error
2021-04-06 20:03:30.677 PDT2021-04-07T03:03:30.677Z INFO controllers.Profile No update on resource quota {"profile": "/kubeflow-user", "spec": "&ResourceQuotaSpec{Hard:ResourceList{},Scopes:[],ScopeSelector:nil,}"}
Error
2021-04-06 20:03:30.684 PDT2021-04-07T03:03:30.684Z INFO controllers.Profile Patch Annotation for service account: {"profile": "kubeflow-user", "namespace ": "kubeflow-user", "name ": "default-editor"}
2021-04-06 20:03:30.689 PDTObserved a panic: runtime.boundsError{x:-24, y:0, signed:true, code:0x3} (runtime error: slice bounds out of range [-24:])
Error
2021-04-06 20:03:30.689 PDT2021-04-07T03:03:30.689Z INFO controllers.Profile Setting up iam policy. {"profile": "kubeflow-user", "ServiceAccount": ""}
Error
2021-04-06 20:03:30.689 PDTgoroutine 307 [running]:
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/runtime.logPanic(0x1ad91c0, 0xc000b0cda0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/runtime/runtime.go:74 +0x95
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/runtime/runtime.go:48 +0x89
Error
2021-04-06 20:03:30.689 PDTpanic(0x1ad91c0, 0xc000b0cda0)
Error
2021-04-06 20:03:30.689 PDT /usr/local/go/src/runtime/panic.go:969 +0x1b9
2021-04-06 20:03:30.689 PDTgithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).GetProjectID(0xc0009c49a0, 0x0, 0x1aee480, 0x16, 0x0)
Error
2021-04-06 20:03:30.689 PDT /workspace/controllers/plugin_workload_identity.go:55 +0x230
Error
2021-04-06 20:03:30.689 PDTgithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).updateWorkloadIdentity(0xc0009c49a0, 0xc000043ba0, 0xd, 0x1be8e3b, 0xe, 0x1ca59c0, 0x1e97a20, 0xc0009c4a10)
Error
2021-04-06 20:03:30.689 PDT /workspace/controllers/plugin_workload_identity.go:86 +0x45
Error
2021-04-06 20:03:30.689 PDTgithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).ApplyPlugin(0xc0009c49a0, 0xc00039f380, 0xc000154a80, 0x1, 0x1)
Error
2021-04-06 20:03:30.689 PDT /workspace/controllers/plugin_workload_identity.go:50 +0x247
Error
2021-04-06 20:03:30.689 PDTgithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).Reconcile(0xc00039f380, 0x0, 0x0, 0xc000137f40, 0xd, 0xc000a38080, 0xc000930d10, 0x484a88, 0xc000882000)
Error
2021-04-06 20:03:30.689 PDT /workspace/controllers/profile_controller.go:282 +0x1d75
2021-04-06 20:03:30.689 PDTsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000174120, 0x1a244c0, 0xc0000c4c60, 0x0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244 +0x2a9
Error
2021-04-06 20:03:30.689 PDTsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000174120, 0x203000)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218 +0xb0
Error
2021-04-06 20:03:30.689 PDTsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000174120)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197 +0x2b
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000a38060)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:155 +0x5f
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000a38060, 0x1e4ea20, 0xc0001706c0, 0x1, 0xc000115ec0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:156 +0xad
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000a38060, 0x3b9aca00, 0x0, 0xc00007f401, 0xc000115ec0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:133 +0x98
Error
2021-04-06 20:03:30.689 PDTk8s.io/apimachinery/pkg/util/wait.Until(0xc000a38060, 0x3b9aca00, 0xc000115ec0)
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:90 +0x4d
Error
2021-04-06 20:03:30.689 PDTcreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
Error
2021-04-06 20:03:30.689 PDT /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:179 +0x416
Error
2021-04-06 20:03:30.691 PDTpanic: runtime error: slice bounds out of range [-24:] [recovered]
2021-04-06 20:03:30.691 PDT panic: runtime error: slice bounds out of range [-24:] goroutine 307 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/runtime/runtime.go:55 +0x10c panic(0x1ad91c0, 0xc000b0cda0) /usr/local/go/src/runtime/panic.go:969 +0x1b9 github.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).GetProjectID(0xc0009c49a0, 0x0, 0x1aee480, 0x16, 0x0) /workspace/controllers/plugin_workload_identity.go:55 +0x230 github.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).updateWorkloadIdentity(0xc0009c49a0, 0xc000043ba0, 0xd, 0x1be8e3b, 0xe, 0x1ca59c0, 0x1e97a20, 0xc0009c4a10) /workspace/controllers/plugin_workload_identity.go:86 +0x45 github.com/kubeflow/kubeflow/components/profile-controller/controllers.(*GcpWorkloadIdentity).ApplyPlugin(0xc0009c49a0, 0xc00039f380, 0xc000154a80, 0x1, 0x1) /workspace/controllers/plugin_workload_identity.go:50 +0x247 github.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).Reconcile(0xc00039f380, 0x0, 0x0, 0xc000137f40, 0xd, 0xc000a38080, 0xc000930d10, 0x484a88, 0xc000882000) /workspace/controllers/profile_controller.go:282 +0x1d75 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000174120, 0x1a244c0, 0xc0000c4c60, 0x0) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244 +0x2a9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000174120, 0x203000) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218 +0xb0 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc000174120) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000a38060) /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000a38060, 0x1e4ea20, 0xc0001706c0, 0x1, 0xc000115ec0) /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:156 +0xad k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000a38060, 0x3b9aca00, 0x0, 0xc00007f401, 0xc000115ec0) /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc000a38060, 0x3b9aca00, 0xc000115ec0) /go/pkg/mod/k8s.io/apimachinery@v0.19.3/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:179 +0x416
Error
Summary: From the log, the crash happens when profile controller tries to set up IAM policy for profile. The core error is panic: runtime error: slice bounds out of range [-24:]
. And something that is strange is that the log Setting up iam policy. {"profile": "kubeflow-user", "ServiceAccount": ""}
has empty ServiceAccount string, this log happens before crashing.
Configuration
Our kustomization from downstream to upstream is as followed:
- https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-share/apps/kubeflow-apps/kustomization.yaml
- https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-share/apps/kubeflow-apps/apps/kustomization.yaml#L25 (with patch https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-share/apps/kubeflow-apps/apps/workload-identity-bindings-patch.yaml)
- https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-share/apps/profiles/kustomization.yaml
Urgency
This issue is blocking us to validate the multi-user feature for KF 1.3 deployment on GCP. Without it, we are not able to deploy or validate other kubeflow apps.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (18 by maintainers)
I think I found the root cause.
Profile CRD did not specify a field
preserveUnknownFields
. The default behavior of CRD changed between v1beta1 and v1, read https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#field-pruning.Previously in v1beta1, unknown fields are preserved by default, but after converting to v1, unknown fields are pruned by default. In https://github.com/kubeflow/kubeflow/commit/23a529d0736a6a511ac191168d49cd941864535b, profiles CRD was bumped to CRD v1 version. Therefore, we can no longer add unknown fields in plugins.spec in https://github.com/kubeflow/manifests/blob/d36fc9c0555c936c7b71fd273b8e4604985ebba8/apps/profiles/upstream/crd/bases/kubeflow.org_profiles.yaml#L65
I updated title to reflect the root cause and make it clearer this is not GCP specific.
That is a good point! We might need to look closer to this CRD change https://github.com/kubeflow/kubeflow/commit/23a529d0736a6a511ac191168d49cd941864535b?branch=23a529d0736a6a511ac191168d49cd941864535b&diff=split#diff-4cceae2882a85453456ceab092e7c9cc138e55b2cc907437c88e7f2d96f25c48
After adding some logging: The gcpServiceAccount was set up until this line: https://github.com/kubeflow/kubeflow/blob/ebc0c4f073397537412694f9f255dbb3fbdf2043/components/profile-controller/controllers/profile_controller.go#L593.
But after PatchDefaultPluginSpec() is returned,
Spec
in plugin is gone. I think it maybe related to golang version upgrade: https://github.com/kubeflow/kubeflow/pull/5617/filesI’m not sure why
gcp.GcpServiceAccount
is initially empty, it is supposed to be set a default value in https://github.com/kubeflow/kubeflow/blob/ebc0c4f073397537412694f9f255dbb3fbdf2043/components/profile-controller/controllers/profile_controller.go#L582-L594I think next step is:
The crashing line of code seems to be https://github.com/kubeflow/kubeflow/blob/ebc0c4f073397537412694f9f255dbb3fbdf2043/components/profile-controller/controllers/plugin_workload_identity.go#L55
because sth wasn’t working,
gcp.GcpServiceAccount
is empty, so it crashes bylen(gcp.GcpServiceAccount)-len(GCP_SA_SUFFIX)=-24
From the area of code triggering the issue, it seems completely GCP specific – the workload identity plugin: https://github.com/kubeflow/kubeflow/blob/ebc0c4f073397537412694f9f255dbb3fbdf2043/components/profile-controller/controllers/plugin_workload_identity.go#L49
Shall we fork profile controller for GCP and fix it in our fork?