noobaa-core: Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled

Environment info

[root@api.ns.cp.fyre.ibm.com ~]# oc version Client Version: 4.7.13 Server Version: 4.7.13 Kubernetes Version: v1.20.0+df9c838 [root@api.ns.cp.fyre.ibm.com ~]# noobaa version INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210719 INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0 [root@api.ns.cp.fyre.ibm.com ~]#

Actual behavior

Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled

Expected behavior

No panic should be shown in operator logs and operator pod should not have restarted

Steps to reproduce

Install noobaa and start a copy object operation into a bucket
While doing copy operation shutdown the node on which noobaa is installed (I have only endpoint pod scheduled on that node , no other noobaa pode)
Start the node

Inf node: [root@api.ns.cp.fyre.ibm.com ~]# oc get pod -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
noobaa-core-0 1/1 Running 0 73m 10.254.3.167 master1.ns.cp.fyre.ibm.com
noobaa-db-pg-0 1/1 Running 0 60m 10.254.4.17 master0.ns.cp.fyre.ibm.com
noobaa-default-backing-store-noobaa-pod-62daf8d7 0/1 Terminating 0 38m master2.ns.cp.fyre.ibm.com
noobaa-endpoint-565dbbd667-gfzt2 1/1 Running 0 74m 10.254.4.14 master0.ns.cp.fyre.ibm.com
noobaa-operator-6d54447bc5-hr7sb 1/1 Running 1 19h 10.254.3.136 master1.ns.cp.fyre.ibm.com

[root@api.ns.cp.fyre.ibm.com ~]#

More information - Screenshots / Logs / Other output

operator.log must-gather.local.2716179569581607829.tar.gz

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (12 by maintainers)

Most upvoted comments

@Igor and I discussed a solution that instead of panicking immediately when encountering an unknown error, the operator will return a temp error and the reconcile will requeu. If it reoccurs several times then the operator will panic. @nimrod-becker WDYT?

dannyzaken on Aug 19, 2021

AFAIU it’s not a recurring panic, and after the operator restarted it did not happen again. @nehasharma5 am I right?

if so I think we should keep the panic and not change it. the panic is there to avoid silent failures when encountering unknown errors. if we see that this specific error is repeating in many cases maybe we can ignore it specifically, but I wouldn’t remove the panic entirely. @igorpick @nimrod-becker WDYT?

dannyzaken on Aug 4, 2021