rook: RGW crashlooping due to dynamic resharding on big bucket

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When dynamic re-sharding of big bucket (more than 30 millions of objects) occurred a lock on the bucket can be held for long (15/20 min) time. Then the RGW can be killed by the livenessprobe as the RGW doesn’t answer anymore or really slowly. As the dynamic re-sharding is managed by the RGW it also means that it killed the re-sharding. Then once the RGW is restarted it start again the re-sharding and is killed again and so on. To resolve this we had to disable operator and remve livenessprobe from the rgw delpoyment

Expected behavior: I would expect the RGW not to be killed so the re-shard can be achieved. I also agree that probably pre-sharding should be done on such bucket but having the re-shard constantly killed is not fine also.

Could we work on a way to disable the livenesspobe or configure the livenesspobe timeout?

How to reproduce it (minimal and precise):

Create huge bucket and see how reshading impact the rgw (for 1k shards)

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator’s logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (16 by maintainers)

Commits related to this issue

rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to cybozu-go/rook by satoru-takeuchi 2 years ago
rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to rook/rook by satoru-takeuchi 2 years ago
rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to parth-gr/rook by satoru-takeuchi 2 years ago
rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to parth-gr/rook by satoru-takeuchi 2 years ago

Most upvoted comments

@leseb sorry I don’t have those logs anymore I will try to reproduce this week or next one on our test env

ashangit on Oct 1, 2020