kube: Weird intermittent hangs in tower after updating to 0.69
Current and expected behavior
So this isn’t going to be the best error report because the problem seems to be intermittent. But I’m going to make it anyway because maybe someone else might be having the same problem.
I was on kube 0.64, everything was working fine, but when I upgraded to 0.69 I noticed that sometimes my k8s resources weren’t being created. I enabled all debug logs are it seems to get stuck on. tower::buffer::worker service.ready=true processing request
. I didn’t have this issue locally. And its intermittent.
Since then I have downgraded back to 0.64 and everything seems to be working fine.
Sorry that I couldn’t be more helpful here. Maybe I should make this in the tower repo. Maybe this is only present after the tower-http 0.2 upgrade. When I have more time I will try a few different things to see if I can narrow it down.
Was curious if anyone else using kube-rs has seen a similar issue.
Possible solution
No response
Additional context
No response
Environment
- gke cluster (1.20.12-gke.1500)
- debian:bullseye-slim
Configuration and features
No response
Affected crates
kube-client
Would you like to work on fixing this bug?
no
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 18 (18 by maintainers)
Commits related to this issue
- temp debug for kube-rs/kube-rs#829 Signed-off-by: clux <sszynrae@gmail.com> — committed to kube-rs/controller-rs by clux 2 years ago
Sorry for this 😦 Thanks for getting the fix released quickly @clux @teozkr.
I thought I had tested it, but maybe that was only with the initial version that used
Mutex
. We can also removecached_token()
withMutex
.What can we do to prevent this in the future? Always update and test
controller-rs
before release?0.69.1 confirmed working in cluster. Thanks!
0.69.1 is now released, and contains the fix!
The fix is now in
master
, working on getting a 0.69.1 release out the door ASAP.think i got it. swapping the rwlock for a mutex and taking the lock outside that branch start seems to stop it from happening
with the following local change on
controller-rs
:i.e. building from kube-rs branch
test-hangs
(where i reverted the linked pr for token reloading) the issue no longer happens, so this is the cause.I can reproduce it in
controller-rs
repo in main now:tilt up
againstk3d
(and you cank apply
thenk delete
one of theyaml/instance*
files to force the controller to work)after about a minute the controller stops handling messages and it prints a bunch of tower and hyper messages:
Update: I upgraded from 0.64 all the way to 0.69. I did not see the issue on any of the versions until 0.69
On 0.69 the issue does not appear right away only after a minute or 2.
Edit 1:
@teozkr diff between 0.64 and 0.69
Edit 2
Diff between 0.68 - 0.69
My plan is to upgrade from 0.64 - 0.69 one version at a time and see when I see the error. I’ll report back findings.
Hm, odd.
cargo update
on the latest build of your app that does work fine for you (whether that ends up using kube-rs 0.64, 0.68, or something in between), and confirm that that still works?diff
between theCargo.lock
s of the newest working and oldest broken versions?