noobaa-core: Intermittant "We encountered an internal error. Please try again"

Environment info

https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/555 This is from HPO-core defect While running a cosbench test I get occasional errors. Sometimes it passes and sometimes it does not.

Errors occurred somewhere between these times Start of run Wed Mar 2 15:30:03 EST 2022 End of run Wed Mar 2 15:34:19 EST 2022

A section from the cosbench log which shows the failure

[root@c83f1-dan4 ~]#
 From the Cosbench log

================================================== stage: s1-prepare_GB ================================================== ---------------------------------- mission: MA4C534C68B, driver: app6 ---------------------------------- 2022-03-02 15:30:02,008 [INFO] [Log4jLogManager] - will append log to file /root/cosbench/log/mission/MA4C534C68B.log 2022-03-02 15:30:02,235 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o1_GB 2022-03-02 15:30:02,236 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o3_GB 2022-03-02 15:30:02,236 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o5_GB 2022-03-02 15:30:02,236 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o2_GB 2022-03-02 15:30:02,236 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o4_GB 2022-03-02 15:30:05,808 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o4_GB 2022-03-02 15:30:06,344 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o1_GB 2022-03-02 15:30:07,363 [WARN] [S3Storage] - below exception encountered when creating object s5001o2_GB at s5001b1: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int) 2022-03-02 15:30:07,365 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o2_GB 2022-03-02 15:30:10,220 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o1_GB 2022-03-02 15:30:10,269 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o3_GB 2022-03-02 15:30:11,647 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o5_GB 2022-03-02 15:30:14,219 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o3_GB 2022-03-02 15:30:16,229 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o4_GB 2022-03-02 15:30:18,461 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o2_GB 2022-03-02 15:30:19,597 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o4_GB 2022-03-02 15:30:19,948 [WARN] [S3Storage] - below exception encountered when creating object s5001o3_GB at s5001b3: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int) 2022-03-02 15:30:19,948 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o3_GB 2022-03-02 15:30:21,821 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o1_GB 2022-03-02 15:30:23,736 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o5_GB 2022-03-02 15:30:31,439 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o2_GB 2022-03-02 15:30:39,427 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o5_GB

Picking out this error to follow:

2022-03-02 15:30:07,363 [WARN] [S3Storage] - below exception encountered when creating object s5001o2_GB at s5001b1: Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via

Looking for that error in the noobaa log:

[root@c83f1-infa internal]# oc get pod -o wide | grep -i endpoint | grep dan3 | awk '{print $1}

'| xargs -ti oc logs {} | grep 20:30:07 | grep s5001o2 oc logs noobaa-endpoint-749785777b-2fzxd oc logs noobaa-endpoint-749785777b-2g7s4 oc logs noobaa-endpoint-749785777b-6jp42 oc logs noobaa-endpoint-749785777b-6wgcc oc logs noobaa-endpoint-749785777b-7sx8t oc logs noobaa-endpoint-749785777b-btkhh oc logs noobaa-endpoint-749785777b-gsrx9 oc logs noobaa-endpoint-749785777b-spp8b oc logs noobaa-endpoint-749785777b-wf4p4 oc logs noobaa-endpoint-749785777b-xsqxc Mar-2 20:30:07.379 [Endpoint/14] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error>InternalError<Message>We encountered an internal error. Please try again.</Message><Resource>/s5001b1/s5001o2_GB</Resource><RequestId>l0a0j66s-amgalv-n9a</RequestId></Error> PUT /s5001b1/s5001o2_GB {“host”:“metallb”,“authorization”:“AWS lDT36NrV8XNBMg0p8ESh:DNkEsP3hIfzKYQSH5sqT2iKjKW8=”,“user-agent”:“aws-sdk-java/1.10.76 Linux/4.18.0-240.el8.x86_64 OpenJDK_64-Bit_Server_VM/25.265-b01/1.8.0_265”,“amz-sdk-invocation-id”:“f7e25c94-f1b5-436a-bc64-97c730ead3fa”,“amz-sdk-retry”:“0/0/”,“date”:“Wed, 02 Mar 2022 20:30:02 GMT”,“content-type”:“application/octet-stream”,“content-length”:“3000000000”,“connection”:“Keep-Alive”} TypeError: callback is not a function Mar-2 20:30:07.427 [Endpoint/14] [L0] core.endpoint.s3.ops.s3_put_object:: PUT OBJECT s5001b2 s5001o2_GB

``` root@c83f1-infa internal]# noobaa status INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:5.9_nsfsfixes-20220215 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:773dbaded46fa0a024c28a92f98f4ad64d370011ed1405f2b16f39b9258eb6b2 INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:da0b8d525b173ef472ff4c71fae60b396f518860d6313c4f3287b844aab6d622 INFO[0000] Namespace: openshift-storage INFO[0000] INFO[0000] CRD Status: INFO[0000] ✅ Exists: CustomResourceDefinition “noobaas.noobaa.io” INFO[0000] ✅ Exists: CustomResourceDefinition “backingstores.noobaa.io” INFO[0000] ✅ Exists: CustomResourceDefinition “namespacestores.noobaa.io” INFO[0000] ✅ Exists: CustomResourceDefinition “bucketclasses.noobaa.io” INFO[0000] ✅ Exists: CustomResourceDefinition “noobaaaccounts.noobaa.io” INFO[0000] ✅ Exists: CustomResourceDefinition “objectbucketclaims.objectbucket.io” INFO[0000] ✅ Exists: CustomResourceDefinition “objectbuckets.objectbucket.io” INFO[0000] INFO[0000] Operator Status: INFO[0000] ✅ Exists: Namespace “openshift-storage” INFO[0000] ✅ Exists: ServiceAccount “noobaa” INFO[0000] ✅ Exists: ServiceAccount “noobaa-endpoint” INFO[0000] ✅ Exists: Role “mcg-operator.v4.9.3-noobaa-7fdbb75fd7” INFO[0000] ✅ Exists: Role “mcg-operator.v4.9.3-noobaa-endpoint-65854bfccb” INFO[0000] ✅ Exists: RoleBinding “mcg-operator.v4.9.3-noobaa-7fdbb75fd7” INFO[0000] ✅ Exists: RoleBinding “mcg-operator.v4.9.3-noobaa-endpoint-65854bfccb” INFO[0000] ✅ Exists: ClusterRole “mcg-operator.v4.9.3-644558fccd” INFO[0000] ✅ Exists: ClusterRoleBinding “mcg-operator.v4.9.3-644558fccd” INFO[0000] ⬛ (Optional) Not Found: ValidatingWebhookConfiguration “admission-validation-webhook” INFO[0000] ⬛ (Optional) Not Found: Secret “admission-webhook-secret” INFO[0000] ⬛ (Optional) Not Found: Service “admission-webhook-service” INFO[0000] ✅ Exists: Deployment “noobaa-operator” INFO[0000] INFO[0000] System Wait Ready: INFO[0000] ✅ System Phase is “Ready”. INFO[0000] INFO[0000] INFO[0000] System Status: INFO[0000] ✅ Exists: NooBaa “noobaa” INFO[0000] ✅ Exists: StatefulSet “noobaa-core” INFO[0000] ✅ Exists: ConfigMap “noobaa-config” INFO[0000] ✅ Exists: Service “noobaa-mgmt” INFO[0000] ✅ Exists: Service “s3” INFO[0000] ✅ Exists: Secret “noobaa-db” INFO[0000] ✅ Exists: ConfigMap “noobaa-postgres-config” INFO[0000] ✅ Exists: ConfigMap “noobaa-postgres-initdb-sh” INFO[0000] ✅ Exists: StatefulSet “noobaa-db-pg” INFO[0000] ✅ Exists: Service “noobaa-db-pg” INFO[0000] ✅ Exists: Secret “noobaa-server” INFO[0000] ✅ Exists: Secret “noobaa-operator” INFO[0000] ✅ Exists: Secret “noobaa-endpoints” INFO[0000] ✅ Exists: Secret “noobaa-admin” INFO[0000] ✅ Exists: StorageClass “openshift-storage.noobaa.io” INFO[0000] ✅ Exists: BucketClass “noobaa-default-bucket-class” INFO[0000] ✅ Exists: Deployment “noobaa-endpoint” INFO[0000] ✅ Exists: HorizontalPodAutoscaler “noobaa-endpoint” INFO[0000] ✅ (Optional) Exists: BackingStore “noobaa-default-backing-store” INFO[0001] ⬛ (Optional) Not Found: CredentialsRequest “noobaa-aws-cloud-creds” INFO[0001] ⬛ (Optional) Not Found: CredentialsRequest “noobaa-azure-cloud-creds” INFO[0001] ⬛ (Optional) Not Found: Secret “noobaa-azure-container-creds” INFO[0001] ⬛ (Optional) Not Found: Secret “noobaa-gcp-bucket-creds” INFO[0001] ⬛ (Optional) Not Found: CredentialsRequest “noobaa-gcp-cloud-creds” INFO[0001] ✅ (Optional) Exists: PrometheusRule “noobaa-prometheus-rules” INFO[0001] ✅ (Optional) Exists: ServiceMonitor “noobaa-mgmt-service-monitor” INFO[0001] ✅ (Optional) Exists: ServiceMonitor “s3-service-monitor” INFO[0001] ✅ (Optional) Exists: Route “noobaa-mgmt” INFO[0001] ✅ (Optional) Exists: Route “s3” INFO[0001] ✅ Exists: PersistentVolumeClaim “db-noobaa-db-pg-0” INFO[0001] ✅ System Phase is “Ready” INFO[0001] ✅ Exists: “noobaa-admin”

#------------------# #- Mgmt Addresses -# #------------------#

ExternalDNS : [https://noobaa-mgmt-openshift-storage.apps.ocp4.pokprv.stglabs.ibm.com] ExternalIP : [] NodePorts : [https://10.28.20.46:32659] InternalDNS : [https://noobaa-mgmt.openshift-storage.svc:443] InternalIP : [https://172.30.179.247:443] PodPorts : [https://10.128.3.154:8443]

#--------------------# #- Mgmt Credentials -# #--------------------#

email : admin@noobaa.io password : BWb0l3hATSA1+BUyJrShrw==

#----------------# #- S3 Addresses -# #----------------#

ExternalDNS : [https://s3-openshift-storage.apps.ocp4.pokprv.stglabs.ibm.com] ExternalIP : [] NodePorts : [https://10.28.20.47:30905 https://10.28.20.46:30905 https://10.28.20.47:30905 https://10.28.20.45:30905 https://10.28.20.45:30905 https://10.28.20.45:30905 https://10.28.20.47:30905 https://10.28.20.45:30905 https://10.28.20.46:30905 https://10.28.20.46:30905 https://10.28.20.47:30905 https://10.28.20.45:30905 https://10.28.20.46:30905 https://10.28.20.45:30905 https://10.28.20.46:30905 https://10.28.20.47:30905 https://10.28.20.46:30905 https://10.28.20.45:30905 https://10.28.20.47:30905 https://10.28.20.45:30905 https://10.28.20.46:30905 https://10.28.20.47:30905 https://10.28.20.47:30905 https://10.28.20.46:30905 https://10.28.20.45:30905 https://10.28.20.45:30905 https://10.28.20.47:30905 https://10.28.20.47:30905 https://10.28.20.46:30905 https://10.28.20.46:30905] InternalDNS : [https://s3.openshift-storage.svc:443] InternalIP : [https://172.30.167.87:443] PodPorts : [https://10.128.1.89:6443 https://10.128.2.173:6443 https://10.128.1.90:6443 https://10.128.4.152:6443 https://10.128.4.154:6443 https://10.128.4.150:6443 https://10.128.1.92:6443 https://10.128.4.147:6443 https://10.128.2.178:6443 https://10.128.2.176:6443 https://10.128.1.93:6443 https://10.128.4.153:6443 https://10.128.2.182:6443 https://10.128.4.148:6443 https://10.128.2.175:6443 https://10.128.1.87:6443 https://10.128.2.177:6443 https://10.128.4.149:6443 https://10.128.1.88:6443 https://10.128.4.155:6443 https://10.128.2.174:6443 https://10.128.1.95:6443 https://10.128.1.86:6443 https://10.128.2.181:6443 https://10.128.4.151:6443 https://10.128.4.146:6443 https://10.128.1.94:6443 https://10.128.1.91:6443 https://10.128.2.179:6443 https://10.128.2.180:6443]

#------------------# #- S3 Credentials -# #------------------#

AWS_ACCESS_KEY_ID : xbtyDghqOmoE3RDoWULe AWS_SECRET_ACCESS_KEY : q9BE7S7KTVZX8N/A5LB9sh9Kv1dSOB/d1P2QjEwg

#------------------# #- Backing Stores -# #------------------#

NAME TYPE TARGET-BUCKET PHASE AGE noobaa-default-backing-store pv-pool Ready 6d1h28m2s

#--------------------# #- Namespace Stores -# #--------------------#

NAME TYPE TARGET-BUCKET PHASE AGE noobaa-s3res-3777249441 nsfs Ready 6d1h27m56s

#------------------# #- Bucket Classes -# #------------------#

NAME PLACEMENT NAMESPACE-POLICY QUOTA PHASE AGE noobaa-default-bucket-class {“tiers”:[{“backingStores”:[“noobaa-default-backing-store”]}]} null null Ready 6d1h28m2s

#-------------------# #- NooBaa Accounts -# #-------------------#

No noobaa accounts found.

#-----------------# #- Bucket Claims -# #-----------------#

No OBCs found.

Must gathers are in box note https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd Could not get oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8 b/c of this error: must-gather-j669f] POD 2022-03-02T21:43:59.287340940Z collecting dump cephblockpools [must-gather-j669f] POD 2022-03-02T21:43:59.508140313Z collecting dump cephclusters [must-gather-j669f] POD 2022-03-02T21:43:59.738039077Z collecting dump cephfilesystems [must-gather-j669f] POD 2022-03-02T21:43:59.978852540Z collecting dump cephobjectstores [must-gather-j669f] POD 2022-03-02T21:44:00.207898724Z collecting dump cephobjectstoreusers [must-gather-j669f] POD 2022-03-02T21:44:01.724486444Z Error from server (NotFound): pods “must-gather-j669f-helper” not found [must-gather-j669f] POD 2022-03-02T21:44:01.738185148Z waiting for helper pod to come up in openshift-storage namespace. Retrying 1 [must-gather-j669f] POD 2022-03-02T21:44:06.926669468Z Error from server (NotFound): pods “must-gather-j669f-helper” not found [must-gather-j669f] POD 2022-03-02T21:44:06.961050391Z waiting for helper pod to come up in openshift-storage namespace. Retrying 2 [must-gather-j669f] POD 2022-03-02T21:44:12.175332535Z Error from server (NotFound): pods "must-gather-j66

Your environment

[root@c83f1-infa internal]# oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.3 NooBaa Operator 4.9.3 mcg-operator.v4.9.2 Succeeded ocs-operator.v4.9.3 OpenShift Container Storage 4.9.3 ocs-operator.v4.9.2 Succeeded odf-operator.v4.9.3 OpenShift Data Foundation 4.9.3 odf-operator.v4.9.2 Succeeded [root@c83f1-infa internal]#

POK Bare Metal, Real Fast cluster 30 enpoint pods Md5 disabled Default CPU/memory on endpoint pods 1 account being tested even though there are 100 accounts. There are 10 buckets for this account 128 workers in Cosbench 4 app nodes but looks like only 2 were used

Steps to reproduce

Cosbench xml

<?xml version="1.0" encoding="UTF-8" ?> <workload name="GBrangesizeobj" description="GB range read and write workload"> <storage type="s3" config="accesskey=lDT36NrV8XNBMg0p8ESh;secretkey=TZDzPCqqbBhjs1nzQX0Y0fK27F2ANCTP81iQjjqj;endpoint=http://metallb:80;path_style_access=true" /> <workflow> <workstage name="prepare_GB"> <work name="prepare_GB" workers="128" interval="10" rampup="0" type="prepare" ratio="100" config="cprefix=s5001b;oprefix=s5001o;osuffix=_GB;containers=r(1,4);objects=r(1,5);sizes=r(1,3)GB"/> </workstage> <workstage name="rangewritereaddelete"> <work name="GBrangewritereaddelete" workers="128" interval="10" rampup="0" runtime="1800"> <operation type="write" ratio="20" config="cprefix=s5001b;oprefix=s5001o;osuffix=_GB;containers=r(1,4);objects=r(7,9);sizes=r(1,3)GB"/> <operation type="read" ratio="60" config="cprefix=s5001b;oprefix=s5001o;osuffix=_GB;containers=r(1,4);objects=r(1,3)"/> <operation type="delete" ratio="20" config="cprefix=s5001b;oprefix=s5001o;osuffix=_GB;containers=r(1,4);objects=r(4,6)"/> </work> </workstage> <workstage name="cleanup_GB"> <work name="cleanup_GB" workers="128" interval="10" rampup="0" type="cleanup" ratio="100" config="cprefix=s5001b;oprefix=s5001o;osuffix=_GB;containers=r(1,4);objects=r(1,9)"/> </workstage> </workflow> </workload>

Expected behaviour

Should be able to write objects consistently

Actual behavior

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 30 (12 by maintainers)

Most upvoted comments

I’ve confirmed that this defect is indeed fixed by

[root@c83f1-app1 RW_moki]# kubectl patch noobaa noobaa --patch '{"spec": {"image": "noobaa/noobaa-core:nsfs_backport_5.9-20220331"}}' --type="merge"

6896 verification: Launch cosbench:


[root@c83f1-dan4 rand_read_write_delete_GB_1hour]# date ; sh /root/cosbench/cli.sh submit /root/cosbench/workloads/RW_workloads/rand_read_write_delete_GB_1hour/s5001_GB_random_size_num_of_objects.xml
Mon Apr  4 14:37:37 EDT 2022
Accepted with ID: w907

Check noobaa endpoint logs

[root@c83f1-app1 RW_moki]# oc logs deployment/noobaa-endpoint --all-containers=true --prefix=true --max-log-requests=40 --since=15m | grep -i PUT | grep -i error
Found 6 pods, using pod/noobaa-endpoint-7fdb5b75fd-cvd4h
[root@c83f1-app1 RW_moki]#

Look at cosbench logs:

Cosbench logs don’t show any sign of error:
================================================== stage: s1-prepare_rGB ==================================================
---------------------------------- mission: MEF5DE3C0D6, driver: app6 ----------------------------------
2022-04-04 14:37:37,011 [INFO] [Log4jLogManager] - will append log to file /root/cosbench/log/mission/MEF5DE3C0D6.log
2022-04-04 14:37:37,243 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o1_rGB
2022-04-04 14:37:37,245 [INFO] [NoneStorage] - performing PUT at /s5001b1/s5001o2_rGB
2022-04-04 14:37:37,313 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o1_rGB
2022-04-04 14:37:37,314 [INFO] [NoneStorage] - performing PUT at /s5001b2/s5001o2_rGB
2022-04-04 14:37:37,346 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o2_rGB
2022-04-04 14:37:37,386 [INFO] [NoneStorage] - performing PUT at /s5001b3/s5001o1_rGB
2022-04-04 14:37:37,394 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o2_rGB
2022-04-04 14:37:37,463 [INFO] [NoneStorage] - performing PUT at /s5001b4/s5001o1_rGB
---------------------------------- mission: M6F5DE3C0D6, driver: app7 ----------------------------------
2022-04-04 14:37:37,011 [INFO] [Log4jLogManager] - will append log to file /root/cosbench/log/mission/M6F5DE3C0D6.log
---------------------------------- mission: M3F5DE3C0F6, driver: app8 ----------------------------------
2022-04-04 14:37:37,009 [INFO] [Log4jLogManager] - will append log to file /root/cosbench/log/mission/M3F5DE3C0F6.log
---------------------------------- mission: MBF5DE3C0E5, driver: dan4 ----------------------------------
2022-04-04 14:37:37,015 [INFO] [Log4jLogManager] - will append log to file /root/cosbench/log/mission/MBF5DE3C0E5.log

I tested 2 more times just to be sure and did not see the signature of this defect.

I do not know the process for closing defects in this team. Do I wait until it is in an ODF build?? Or is patch verification sufficient?

@romayalon @MonicaLemay wow, what a great work on this issue! I would add it to the pantheon of issues as a “higgs-bugsons” from the good old list of unusual software bugs.

@nimrod-becker and @akmithal will be able to advise on closing.

Hi Monica, I think it really helped, the file system error is still EINVAL and according to https://linux.die.net/man/2/writev -

EINVAL
The sum of the iov_len values overflows an ssize_t value. Or, the vector count iovcnt is less than zero or greater than the permitted maximum.

And from logs:

2022-03-25 17:10:56.132046 [PID-15/TID-15] [L1] FS::FSWorker::OnError: FileWritev _wrap->_path=/nsfs/noobaa-s3res-3777249441/s5003/s5003b4/.noobaa-nsfs_622115d4c8206400294c9c6e/uploads/c47886f6-b3c9-4b40-817c-a32055aa2049 _total_len=8388608 buffers_len=1315  error.Message()=Invalid argument 

buffers_len=1315 I think we passed the maximum because according to man - On Linux, the limit advertised by these mechanisms is 1024