rook: RGW crashlooping due to dynamic resharding on big bucket

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: When dynamic re-sharding of big bucket (more than 30 millions of objects) occurred a lock on the bucket can be held for long (15/20 min) time. Then the RGW can be killed by the livenessprobe as the RGW doesn’t answer anymore or really slowly. As the dynamic re-sharding is managed by the RGW it also means that it killed the re-sharding. Then once the RGW is restarted it start again the re-sharding and is killed again and so on. To resolve this we had to disable operator and remve livenessprobe from the rgw delpoyment

Expected behavior: I would expect the RGW not to be killed so the re-shard can be achieved. I also agree that probably pre-sharding should be done on such bucket but having the re-shard constantly killed is not fine also.

Could we work on a way to disable the livenesspobe or configure the livenesspobe timeout?

How to reproduce it (minimal and precise):

Create huge bucket and see how reshading impact the rgw (for 1k shards)

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
  • Operator’s logs, if necessary
  • Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@leseb sorry I don’t have those logs anymore I will try to reproduce this week or next one on our test env