kubeflow: Notebook CRD Instances Missing Status Fields

/kind bug

What steps did you take and what happened: This is an issue that occurs occasionally, but I don’t have enough understandings of the notebook-controller to be able to cause the issue to be reproduced reliably. But it has happened on multiple occasions.

From the user perspective, when they launch a notebook via the web interface, the newly-launched notebook stays with the spinning icon indefinitely and never enables the “CONNECT” button. The statefulset and pod seems to start fine but the notebook is never accessible.

What did you expect to happen: Notebook should launch and enable “CONNECT” button after a short period for the user to access said notebook.

Anything else you would like to add: notebook-controller logs mention reconciler errors like:

2021-07-20T15:12:12.424Z	INFO	controllers.Notebook	Updating Status	{"notebook": "test-ns/test", "namespace": "test-ns", "name": "test"}
2021-07-20T15:12:12.431Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "notebook", "request": "test-ns/test", "error": "Notebook.kubeflow.org \"test\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\""}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88

The instance of this Notebook CRD is:

#> k get notebooks test -o yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  annotations:
    notebooks.kubeflow.org/server-type: jupyter
  creationTimestamp: "2021-07-20T14:54:58Z"
  generation: 1
  labels:
    app: test
  managedFields:
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:notebooks.kubeflow.org/server-type: {}
        f:labels:
          .: {}
          f:app: {}
      f:spec:
        .: {}
        f:template:
          .: {}
          f:spec:
            .: {}
            f:containers: {}
            f:serviceAccountName: {}
            f:tolerations: {}
            f:volumes: {}
    manager: Swagger-Codegen
    operation: Update
    time: "2021-07-20T14:54:58Z"
  name: test
  namespace: test-ns
  resourceVersion: "1175354267"
  selfLink: /apis/kubeflow.org/v1/namespaces/test-ns/notebooks/test
  uid: 697a9472-4252-4803-a8ff-8f3ea38452b6
spec:
  template:
    spec:
      containers:
      - env: []
        image: <REDACTED>:latest
        imagePullPolicy: IfNotPresent
        name: test
        resources:
          limits:
            cpu: "4.8"
            memory: 9.6Gi
          requests:
            cpu: "4"
            memory: 8Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      serviceAccountName: default-editor
      tolerations: []
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm

It seems like the status fields are never created, thus the notebook-controller complains of invalid types in the non-existant status.conditions field when it goes to reconcile against pod state.

Environment:

  • Kubeflow version: public.ecr.aws/j1r0q0g6/notebooks/notebook-controller:v1.3.0-rc.0
  • kfctl version: (use kfctl version):
  • Kubernetes platform: kubeadm
  • Kubernetes version: 1.19
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.2 LTS

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 15 (9 by maintainers)

Commits related to this issue

Most upvoted comments

I managed to reproduce the error by following the instructions from here https://github.com/canonical/bundle-kubeflow/issues/460

After further investigation with @kimwnasptd we found the root cause of this error is race conditions when updating the various status fields of a Notebook CR.


Details

Initially we wanted to understand how the Notebook CR ended up with null status.Conditions.

The current implementation of the controller performs a Status().Update() in two places:

  1. When the status.readyReplicas field of STS has changed (code)
  2. When the ContainerState of the container that has the same name as the Notebook CR has changed (code).

It is important to note that before we perform any update the default values of a Notebook CR status, are:

"instance.Status": {
             "conditions":null,
             "readyReplicas":0,
             "containerState":{}
}

Also, the current implementation checks if the status.ReadyReplicas field of the CR must be updated before it updates the status.Conditions field (code).

So, we found that it is possible for the reconciliation loop to be slow enough so that both the ContainerState in the Pod has changed (i.e waiting, running) and the status.ReadyReplicas field of the STS has also changed (from 0 to 1) and so a Status().Update() (from this line) is performed before the controller being able to update the Notebook CR’s Status.Conditions field accordingly, which remains with the default null value.


Solution Proposal

In order to solve the above situation we propose the following three changes:

  1. Add an initialization function for the status of a Notebook CR
    • Having undefined fields makes troubleshooting of error situations more difficult.
  2. Perform a single Status().Update() to the end of the Reconcile()
  3. Mirror the Pod conditions to the Notebook instead of appending them. https://github.com/kubeflow/kubeflow/issues/6528
    • This will be addresed by this PR which is under review by @kimwnasptd

i add this

	if instance.Status.Conditions == nil {
			instance.Status.Conditions = make([]v1beta1.NotebookCondition, 0)
		}

before r.Status().Update seems work