ceph-csi: RBD: OOMKills occurs when secret metadata encryption type is used with multiple PVC create request.

I tested secret based encryption with 3.7.1 i dont see any crash with below limits

Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     250m
      memory:  256Mi
[🎩︎]mrajanna@fedora rbd $]kubectl get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
claim0    Bound    pvc-81cb47af-dfbd-4360-b444-7636e8a2c359   1Gi        RWO            rook-ceph-block   4m48s
claim1    Bound    pvc-c1c20239-db18-4c56-b9da-8745b0046428   1Gi        RWO            rook-ceph-block   4m47s
claim10   Bound    pvc-7d6e66d8-ceea-4e0c-9775-64aa84b1548b   1Gi        RWO            rook-ceph-block   4m46s
claim2    Bound    pvc-3a61744c-2d0e-46c1-9d8c-3b0f5f49574c   1Gi        RWO            rook-ceph-block   4m47s
claim3    Bound    pvc-aa2613fb-4db8-4254-801b-8d9d72e83979   1Gi        RWO            rook-ceph-block   4m47s
claim4    Bound    pvc-00fbe104-1809-4a11-8c39-9e3ceee9d5c9   1Gi        RWO            rook-ceph-block   4m47s
claim5    Bound    pvc-7ad36255-755b-4bdd-a88c-4bbf695e8b69   1Gi        RWO            rook-ceph-block   4m47s
claim6    Bound    pvc-bcd48a37-3d8d-47ce-b780-511020690397   1Gi        RWO            rook-ceph-block   4m47s
claim7    Bound    pvc-1fac5dcf-1668-489a-8799-16630b74e971   1Gi        RWO            rook-ceph-block   4m47s
claim8    Bound    pvc-949006fd-5b47-4e9c-acc1-3808894245f8   1Gi        RWO            rook-ceph-block   4m46s
claim9    Bound    pvc-05ab1ba5-4c8f-41a5-ab09-c042b6089b23   1Gi        RWO            rook-ceph-block   4m46s

but when i tested with metadata type encryption i can see the crash, this confirms we have a memory leak?

_Originally posted by @Madhu-1 in https://github.com/ceph/ceph-csi/issues/3402#issuecomment-1278691078_

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (15 by maintainers)

Commits related to this issue

Most upvoted comments

On Mon, Jan 09, 2023 at 12:59:35AM -0800, Lennart Jern wrote:

Some more experiments also hints that the issue is the parallel scrypt.Key calls. With a memory limit of 256MiB, I can create 100 PVCs without issue as long as I do it one at a time with 1 sec sleep between. On the other hand, if I create 5-10 PVCs in one go I immediately get OOM.

Any ideas for how to solve this? I’m not familiar with the code so unsure where to start looking. 🙁

Sounds like we need limit the number of concurrent scrypt.Key calls. A semaphore like https://pkg.go.dev/golang.org/x/sync/semaphore might help with that.

I guess it needs some configurable option, passed on the commandline of the container.

This OOMKill happens in the csi-rbdplugin nodeplugin pod right ? the nodeplugin calls cryptsetup.

This issue was originally to address OOMKills in provisioner pod during volume creation. That seems to be resolved now.

I’d expect the CO in this case kublet or the one responsible for issuing nodepublish/stage calls to have some kind of limit for simultaneous csi calls.

Similar to the one for csi-provsioner https://github.com/kubernetes-csi/external-provisioner/blob/cd81ed5d31835d1aabad74db869c1165df9e3666/cmd/csi-provisioner/csi-provisioner.go#L81