kubeflow: Notebook CRD Instances Missing Status Fields
/kind bug
What steps did you take and what happened: This is an issue that occurs occasionally, but I don’t have enough understandings of the notebook-controller to be able to cause the issue to be reproduced reliably. But it has happened on multiple occasions.
From the user perspective, when they launch a notebook via the web interface, the newly-launched notebook stays with the spinning icon indefinitely and never enables the “CONNECT” button. The statefulset and pod seems to start fine but the notebook is never accessible.
What did you expect to happen: Notebook should launch and enable “CONNECT” button after a short period for the user to access said notebook.
Anything else you would like to add: notebook-controller logs mention reconciler errors like:
2021-07-20T15:12:12.424Z INFO controllers.Notebook Updating Status {"notebook": "test-ns/test", "namespace": "test-ns", "name": "test"}
2021-07-20T15:12:12.431Z ERROR controller-runtime.controller Reconciler error {"controller": "notebook", "request": "test-ns/test", "error": "Notebook.kubeflow.org \"test\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\""}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
The instance of this Notebook CRD is:
#> k get notebooks test -o yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
annotations:
notebooks.kubeflow.org/server-type: jupyter
creationTimestamp: "2021-07-20T14:54:58Z"
generation: 1
labels:
app: test
managedFields:
- apiVersion: kubeflow.org/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:notebooks.kubeflow.org/server-type: {}
f:labels:
.: {}
f:app: {}
f:spec:
.: {}
f:template:
.: {}
f:spec:
.: {}
f:containers: {}
f:serviceAccountName: {}
f:tolerations: {}
f:volumes: {}
manager: Swagger-Codegen
operation: Update
time: "2021-07-20T14:54:58Z"
name: test
namespace: test-ns
resourceVersion: "1175354267"
selfLink: /apis/kubeflow.org/v1/namespaces/test-ns/notebooks/test
uid: 697a9472-4252-4803-a8ff-8f3ea38452b6
spec:
template:
spec:
containers:
- env: []
image: <REDACTED>:latest
imagePullPolicy: IfNotPresent
name: test
resources:
limits:
cpu: "4.8"
memory: 9.6Gi
requests:
cpu: "4"
memory: 8Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
serviceAccountName: default-editor
tolerations: []
volumes:
- emptyDir:
medium: Memory
name: dshm
It seems like the status fields are never created, thus the notebook-controller complains of invalid types in the non-existant status.conditions field when it goes to reconcile against pod state.
Environment:
- Kubeflow version: public.ecr.aws/j1r0q0g6/notebooks/notebook-controller:v1.3.0-rc.0
- kfctl version: (use
kfctl version
): - Kubernetes platform: kubeadm
- Kubernetes version: 1.19
- OS (e.g. from
/etc/os-release
): Ubuntu 20.04.2 LTS
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 15 (9 by maintainers)
Commits related to this issue
- Remove issue workaround Remove the workaround for this issue: https://github.com/kubeflow/kubeflow/issues/6056 — committed to canonical/notebook-operators by deleted user 3 years ago
- fix: temporarily disable test_notebook due to upstream bug Can re-enable when [this](https://github.com/kubeflow/kubeflow/issues/6056) is resolved — committed to canonical/bundle-kubeflow by ca-scribner 2 years ago
- Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com — committed to apo-ger/kubeflow by apo-ger 2 years ago
- Fix #6056: Update Notebook status properly (#6628) * Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com * Added suggested code changes Signed-off-by: ... — committed to kubeflow/kubeflow by apo-ger 2 years ago
- Fix #6056: Update Notebook status properly (#6628) * Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com * Added suggested code changes Signed-off-by: ... — committed to arrikto/kubeflow by apo-ger 2 years ago
- Fix #6056: Update Notebook status properly (#6628) * Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com * Added suggested code changes Signed-off-by: ... — committed to maroroman/kubeflow by apo-ger 2 years ago
- Kf1.7 upgrade (#142) * release: Images for the 1.5.0 tag (#6398) Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com> * added env for app secure cookies (#6395) * build(deps): bump gith... — committed to StatCan/kubeflow by mathis-marcotte a year ago
- Fix #6056: Update Notebook status properly (#6628) * Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com * Added suggested code changes Signed-off-by: ... — committed to harshad16/odh-kubeflow by apo-ger 2 years ago
- Fix #6056: Update Notebook status properly (#6628) * Fix #6056: Update Notebook status properly Signed-off-by: Apostolos Gerakaris apoger@arrikto.com * Added suggested code changes Signed-off-by: ... — committed to ca-scribner/kubeflow by apo-ger 2 years ago
I managed to reproduce the error by following the instructions from here https://github.com/canonical/bundle-kubeflow/issues/460
After further investigation with @kimwnasptd we found the root cause of this error is race conditions when updating the various status fields of a Notebook CR.
Details
Initially we wanted to understand how the Notebook CR ended up with null
status.Conditions
.The current implementation of the controller performs a
Status().Update()
in two places:status.readyReplicas
field of STS has changed (code)ContainerState
of the container that has the same name as the Notebook CR has changed (code).It is important to note that before we perform any update the default values of a Notebook CR status, are:
Also, the current implementation checks if the
status.ReadyReplicas
field of the CR must be updated before it updates thestatus.Conditions
field (code).So, we found that it is possible for the reconciliation loop to be slow enough so that both the
ContainerState
in the Pod has changed (i.e waiting, running) and thestatus.ReadyReplicas
field of the STS has also changed (from 0 to 1) and so aStatus().Update()
(from this line) is performed before the controller being able to update the Notebook CR’sStatus.Conditions
field accordingly, which remains with the default null value.Solution Proposal
In order to solve the above situation we propose the following three changes:
Status().Update()
to the end of the Reconcile()i add this
before r.Status().Update seems work