noobaa-core: Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled

Environment info

[root@api.ns.cp.fyre.ibm.com ~]# oc version Client Version: 4.7.13 Server Version: 4.7.13 Kubernetes Version: v1.20.0+df9c838 [root@api.ns.cp.fyre.ibm.com ~]# noobaa version INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210719 INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0 [root@api.ns.cp.fyre.ibm.com ~]#

Actual behavior

  1. Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled

Expected behavior

  1. No panic should be shown in operator logs and operator pod should not have restarted

Steps to reproduce

  1. Install noobaa and start a copy object operation into a bucket
  2. While doing copy operation shutdown the node on which noobaa is installed (I have only endpoint pod scheduled on that node , no other noobaa pode)
  3. Start the node

Inf node: [root@api.ns.cp.fyre.ibm.com ~]# oc get pod -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
noobaa-core-0 1/1 Running 0 73m 10.254.3.167 master1.ns.cp.fyre.ibm.com
noobaa-db-pg-0 1/1 Running 0 60m 10.254.4.17 master0.ns.cp.fyre.ibm.com
noobaa-default-backing-store-noobaa-pod-62daf8d7 0/1 Terminating 0 38m master2.ns.cp.fyre.ibm.com
noobaa-endpoint-565dbbd667-gfzt2 1/1 Running 0 74m 10.254.4.14 master0.ns.cp.fyre.ibm.com
noobaa-operator-6d54447bc5-hr7sb 1/1 Running 1 19h 10.254.3.136 master1.ns.cp.fyre.ibm.com

[root@api.ns.cp.fyre.ibm.com ~]#

More information - Screenshots / Logs / Other output

operator.log must-gather.local.2716179569581607829.tar.gz

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (12 by maintainers)

Most upvoted comments

@Igor and I discussed a solution that instead of panicking immediately when encountering an unknown error, the operator will return a temp error and the reconcile will requeu. If it reoccurs several times then the operator will panic. @nimrod-becker WDYT?

AFAIU it’s not a recurring panic, and after the operator restarted it did not happen again. @nehasharma5 am I right?

if so I think we should keep the panic and not change it. the panic is there to avoid silent failures when encountering unknown errors. if we see that this specific error is repeating in many cases maybe we can ignore it specifically, but I wouldn’t remove the panic entirely. @igorpick @nimrod-becker WDYT?