noobaa-core: IO upload fails when worker node where Noobaa core pod runs is shutdown

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

oc version Client Version: 4.9.5 Server Version: 4.9.5 Kubernetes Version: v1.22.0-rc.0+a44d0f0

oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.2 NooBaa Operator 4.9.2 mcg-operator.v4.9.1 Succeeded ocs-operator.v4.9.2 OpenShift Container Storage 4.9.2 ocs-operator.v4.9.1 Succeeded odf-operator.v4.9.2 OpenShift Data Foundation 4.9.2 odf-operator.v4.9.1 Succeeded

ODF:4.9.2-9 build

noobaa status INFO[0000] CLI version: 5.9.2 INFO[0000] noobaa-image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:5507f2c1074bfb023415f0fef16ec42fbe6e90c540fc45f1111c8c929e477910 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:b314ad9f15a10025bade5c86857a7152c438b405fdba26f64826679a5c5bff1b INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:623bdaa1c6ae047db7f62d82526220fac099837afd8770ccc6acfac4c7cff100 INFO[0000] Namespace: openshift-storage

Actual behavior

  1. IO was spawned from 3 concurrent users (50G/40G/30G) in the background on to their individual buckets and then the worker1 node was shutdown. It was running noobaa-core pod at the time of shutdown and the pod moved to worker2 node, however the IO failed as shown below

grep upload /tmp/noobaa-core-worker1.down.04Feb2022.log upload failed: …/dd_file_40G to s3://newbucket-u5300-01feb/dd_file_40G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. upload failed: …/dd_file_30G to s3://newbucket-u5302-01feb/dd_file_30G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. upload failed: …/dd_file_50G to s3://newbucket-u5301-01feb/dd_file_50G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again.

urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host ‘10.17.127.180’. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings upload failed: …/dd_file_30G to s3://newbucket-u5302-01feb/dd_file_30G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. … urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host ‘10.17.127.179’. Adding certificate verification is str ongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host ‘10.17.127.179’. Adding certificate verification is str ongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings upload failed: …/dd_file_50G to s3://newbucket-u5301-01feb/dd_file_50G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again.---------

Steps to reproduce

  1. Run Concurrent IO from 3 nodes onto the individual buckets

AWS_ACCESS_KEY_ID=vCvYu1lY0AfMTJZ5n9HB AWS_SECRET_ACCESS_KEY=LFHnnQsxxS0iXOS4eDkNU1K7x1IfYG8CtgrvIsin aws --endpoint https://10.17.127.178 --no-verify-ssl s3 cp /root/dd_file_40G s3://newbucket-u5300-01feb & AWS_ACCESS_KEY_ID=mdTnAsuzireuISl5DFXO AWS_SECRET_ACCESS_KEY=y0UvsBKs+R4FFez+FtV/tqT7e+hSToizQqPApGog aws --endpoint https://10.17.127.179 --no-verify-ssl s3 cp /root/dd_file_50G s3://newbucket-u5301-01feb & AWS_ACCESS_KEY_ID=DDZVAUjYrCCODgg7sCbZ AWS_SECRET_ACCESS_KEY=ku5QVHRa45O/XM+z2kRwHtLtIOh1J64dyPa6Ig9b aws --endpoint https://10.17.127.180 --no-verify-ssl s3 cp /root/dd_file_30G s3://newbucket-u5302-01feb &

  1. Noobaa pods are as follows

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES noobaa-core-0 1/1 Running 0 15h 10.254.14.166 worker1.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 179m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-f7llk 1/1 Running 0 178m 10.254.14.167 worker1.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 178m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (6h51m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (9h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none>

  1. Worker1 was shutdown [root@api.rkomandu-ta.cp.fyre.ibm.com ~]# oc get nodes NAME STATUS ROLES AGE VERSION master0.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master1.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master2.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker1.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0

Making the worker1 down where noobaa-core is running

  1. Noobaa core pod is moving to worker2 NAME READY STATUS RESTARTS AGE IP NODE N OMINATED NODE READINESS GATES noobaa-core-0 0/1 ContainerCreating 0 1s <none> worker2.rkomandu-ta.cp.fyre.ibm.com < none> <none> noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com < none> <none> noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 3h14m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 3h13m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com < none> <none> noobaa-endpoint-7bdd48fccb-hjh9r 0/1 Pending 0 1s <none> <none> < none> <none> noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (7h6m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (10h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none> rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com < none> <none>

  2. oc get nodes [root@api.rkomandu-ta.cp.fyre.ibm.com ~]# oc get nodes NAME STATUS ROLES AGE VERSION master0.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master1.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master2.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker1.rkomandu-ta.cp.fyre.ibm.com NotReady worker 56d v1.22.0-rc.0+a44d0f0 worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0

  3. noobaa core pod is Running on worker2 (migrated from worker1 node)

Every 3.0s: oc get pods -n openshift-storage -o wide api.rkomandu-ta.cp.fyre.ibm.com: Fri Feb 4 01:08:28 2022

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED N ODE READINESS GATES noobaa-core-0 1/1 Running 0 60s 10.254.20.168 worker2.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 3h15m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 3h14m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com <none> <none> noobaa-endpoint-7bdd48fccb-hjh9r 0/1 Pending 0 60s <none> <none> <none> <none> noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (7h7m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (10h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none> rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com <none> <none>

  1. Upload fails as shown above

Expected behavior

  1. Upload shouldn’t fail as the IO is able to service via the MetalLB IP’s (HA available) and the IO should complete
Noobaa-endpoint log snippet from worker0 (remember: worker1 node was made down) 
----------------------------------------------------------------------------------------------------
[32mFeb-4 9:18:06.793[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.794[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.794[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.993[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.115[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.276[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.336[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.107[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.307[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.635[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
2022-02-04 09:18:11.796835 [PID-14/TID-14] [L1] FS::FSWorker::Begin: Readdir _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.796976 [PID-14/TID-24] [L1] FS::FSWorker::Execute: Readdir _path=/nsfs/noobaa-s3res-4080029599 _uid=0 _gid=0 _backend=GPFS
2022-02-04 09:18:11.797332 [PID-14/TID-24] [L1] FS::FSWorker::Execute: Readdir _path=/nsfs/noobaa-s3res-4080029599  took: 0.270146 ms
2022-02-04 09:18:11.797409 [PID-14/TID-14] [L1] FS::FSWorker::OnOK: Readdir _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.797557 [PID-14/TID-14] [L1] FS::FSWorker::Begin: Stat _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.797623 [PID-14/TID-23] [L1] FS::FSWorker::Execute: Stat _path=/nsfs/noobaa-s3res-4080029599 _uid=0 _gid=0 _backend=GPFS
2022-02-04 09:18:11.797679 [PID-14/TID-23] [L1] FS::FSWorker::Execute: Stat _path=/nsfs/noobaa-s3res-4080029599  took: 0.01195 ms
2022-02-04 09:18:11.797720 [PID-14/TID-14] [L1] FS::Stat::OnOK: _path=/nsfs/noobaa-s3res-4080029599 _stat_res.st_ino=3 _stat_res.st_size=262144
[32mFeb-4 9:18:11.797[35m [Endpoint/14] [36m   [L0][39m core.server.bg_services.namespace_monitor:: update_last_monitoring: monitoring for noobaa-s3res-4080029599, 61f22b92543779002bee71a5 finished successfully..

Collected MG on the cluster 1 - when node was down (worker1) the “oc adm must-gather” must-gather-collected-when-worker1.down.tar.gz

2 - when node got to Active state (worker1) the “oc adm must-gather”

must-gather-collected-when-worker1.up.tar.gz

More information - Screenshots / Logs / Other output

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 20 (5 by maintainers)

Most upvoted comments

@nimrod-becker, for pods without PVs like core, the reschedule happens once the API Server marks the node as NOTREADY which happens 1 minute after the node failure. For pods with PVs like db, there is additional storage system overhead of detaching the PV from the failing node and attaching it to the new node. This overhead depends on storage system implementation.