rook: RGW crashlooping due to dynamic resharding on big bucket
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: When dynamic re-sharding of big bucket (more than 30 millions of objects) occurred a lock on the bucket can be held for long (15/20 min) time. Then the RGW can be killed by the livenessprobe as the RGW doesn’t answer anymore or really slowly. As the dynamic re-sharding is managed by the RGW it also means that it killed the re-sharding. Then once the RGW is restarted it start again the re-sharding and is killed again and so on. To resolve this we had to disable operator and remve livenessprobe from the rgw delpoyment
Expected behavior: I would expect the RGW not to be killed so the re-shard can be achieved. I also agree that probably pre-sharding should be done on such bucket but having the re-shard constantly killed is not fine also.
Could we work on a way to disable the livenesspobe or configure the livenesspobe timeout?
How to reproduce it (minimal and precise):
Create huge bucket and see how reshading impact the rgw (for 1k shards)
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary - Operator’s logs, if necessary
- Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.
Environment:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Cloud provider or hardware configuration:
- Rook version (use
rook versioninside of a Rook Pod): - Storage backend version (e.g. for ceph do
ceph -v): - Kubernetes version (use
kubectl version): - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (16 by maintainers)
Commits related to this issue
- rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to cybozu-go/rook by satoru-takeuchi 2 years ago
- rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to rook/rook by satoru-takeuchi 2 years ago
- rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to parth-gr/rook by satoru-takeuchi 2 years ago
- rgw: fix startup probe It's better to set the same handler to startupProbe as livenessProbe. Otherwise, we might hit the following problem. https://github.com/rook/rook/issues/6304 Signed-off-by: S... — committed to parth-gr/rook by satoru-takeuchi 2 years ago
@leseb sorry I don’t have those logs anymore I will try to reproduce this week or next one on our test env