ceph-csi: [RBD] parallel PVC creation on a newly created block pool will hang

Describe the bug

The following bug in librbd causes parallel pvc creation request on a newly created block pool to hang.

Concurrent rbd_pool_init() or rbd_create() operations on an unvalidated
(uninitialized) pool trigger a lockup in ValidatePoolRequest state
machine caused by blocking selfmanaged_snap_{create,remove}() calls.

Ceph issue tracker : https://tracker.ceph.com/issues/52537

Ceph pacific backport pr with fix : https://github.com/ceph/ceph/pull/43113

Environment details

  • Image/version of Ceph CSI driver : v3.4.0

Steps to reproduce

  • Create new rbd block pool(with no images) + StorageClass against the CSI provisioner.
  • Create Multiple PVCs in parallel
  • Creation request will stay in pending state indefinitely

Actual results

  • Creation request will stay in pending state indefinitely

Expected behavior

  • Creation request should succeed

Updated Work Around (does not leave stale imap entries, thanks @Madhu-1 )

  • execute rbd pool init <pool_name> directly on cluster or from csi pods.
  • Restart csi rbdplugin provisioner pod
  • and PVCs will go to bound state without leaving any stale resources.

After the above steps, parallel PVC creation requests should work fine.

Work Around (Not recommended, will leave stale omap entries)

  • Delete ongoing PVC creation requests.

  • Restart csi rbdplugin provisioner pod Either

    • Issue a single PVC create request which will succeed.
    • or call rbd pool init <pool_name> on the ceph cluster.

After the above steps, parallel PVC creation requests should work fine.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@Rakshith-R this can be closed? Thanks madhu for reminding.

Yes, ceph 16.2.7 was released in December https://github.com/ceph/ceph/releases/tag/v16.2.7, it contains the fix and in turn cephcsi v3.5.0 will have the fix for this issue. Closing the issue for the above reason.

dockerhub did not have the latest version of ceph, base images are now being pulled from quay.io And release v3.5.1 should have the fix for this issue. refer #2796

Thanks for notifying on this one