kube: Weird intermittent hangs in tower after updating to 0.69

Current and expected behavior

So this isn’t going to be the best error report because the problem seems to be intermittent. But I’m going to make it anyway because maybe someone else might be having the same problem.

I was on kube 0.64, everything was working fine, but when I upgraded to 0.69 I noticed that sometimes my k8s resources weren’t being created. I enabled all debug logs are it seems to get stuck on. tower::buffer::worker service.ready=true processing request. I didn’t have this issue locally. And its intermittent.

Since then I have downgraded back to 0.64 and everything seems to be working fine.

Sorry that I couldn’t be more helpful here. Maybe I should make this in the tower repo. Maybe this is only present after the tower-http 0.2 upgrade. When I have more time I will try a few different things to see if I can narrow it down.

Was curious if anyone else using kube-rs has seen a similar issue.

Possible solution

No response

Additional context

No response

Environment

  • gke cluster (1.20.12-gke.1500)
  • debian:bullseye-slim

Configuration and features

No response

Affected crates

kube-client

Would you like to work on fixing this bug?

no

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Sorry for this 😦 Thanks for getting the fix released quickly @clux @teozkr.

I thought I had tested it, but maybe that was only with the initial version that used Mutex. We can also remove cached_token() with Mutex.


What can we do to prevent this in the future? Always update and test controller-rs before release?

0.69.1 confirmed working in cluster. Thanks!

0.69.1 is now released, and contains the fix!

The fix is now in master, working on getting a 0.69.1 release out the door ASAP.

think i got it. swapping the rwlock for a mutex and taking the lock outside that branch start seems to stop it from happening

with the following local change on controller-rs:

diff --git Cargo.toml Cargo.toml
index 76fd00e..5b5685e 100644
--- Cargo.toml
+++ Cargo.toml
@@ -51,10 +51,10 @@ tonic = { version = "0.5.2", optional = true }
 
 [dependencies.kube]
 features = ["runtime", "client", "derive"]
-version = "0.69.0"
+#version = "0.69.0"
 
 # testing new releases - ignore
-#git = "https://github.com/kube-rs/kube-rs.git"
+git = "https://github.com/kube-rs/kube-rs.git"
 #rev = "3362657aac00d7f71adce1a5d5deb6aad429ab8d"
-#branch = "master"
+branch = "test-hangs"
 #path = "../kube-rs/kube"

i.e. building from kube-rs branch test-hangs (where i reverted the linked pr for token reloading) the issue no longer happens, so this is the cause.

I can reproduce it in controller-rs repo in main now: tilt up against k3d (and you can k apply then k delete one of the yaml/instance* files to force the controller to work)

after about a minute the controller stops handling messages and it prints a bunch of tower and hyper messages:

foo-controll… │ {"timestamp":"2022-02-16T16:54:52.700113Z","level":"TRACE","fields":{"message":"worker polling for next message"},"target":"tower::buffer::worker"}



foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384825Z","level":"TRACE","fields":{"message":"decode; state=Chunked(Size, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384847Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384875Z","level":"TRACE","fields":{"message":"received 905 bytes"},"target":"hyper::proto::h1::io"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384883Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384888Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384893Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384897Z","level":"TRACE","fields":{"message":"Chunk size is 898"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384904Z","level":"DEBUG","fields":{"message":"incoming chunked header: 0x382 (898 bytes)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384910Z","level":"TRACE","fields":{"message":"Chunked read, remaining=898"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384925Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384940Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384955Z","level":"TRACE","fields":{"message":"decode; state=Chunked(BodyCr, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384962Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:55.384974Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(Size, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.112946Z","level":"TRACE","fields":{"message":"decode; state=Chunked(Size, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.112967Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.112992Z","level":"TRACE","fields":{"message":"received 903 bytes"},"target":"hyper::proto::h1::io"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.112999Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113003Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113006Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113009Z","level":"TRACE","fields":{"message":"Chunk size is 896"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113014Z","level":"DEBUG","fields":{"message":"incoming chunked header: 0x380 (896 bytes)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113018Z","level":"TRACE","fields":{"message":"Chunked read, remaining=896"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113030Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113066Z","level":"TRACE","fields":{"message":"decode; state=Chunked(BodyCr, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113074Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113084Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(Size, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:54:56.113109Z","level":"TRACE","fields":{"message":"Non-error received, resetting backoff"},"target":"kube_runtime::utils::stream_backoff"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040278Z","level":"TRACE","fields":{"message":"decode; state=Chunked(Size, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040343Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040448Z","level":"TRACE","fields":{"message":"received 118 bytes"},"target":"hyper::proto::h1::io"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040473Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040487Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040502Z","level":"TRACE","fields":{"message":"Chunk size is 112"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040516Z","level":"DEBUG","fields":{"message":"incoming chunked header: 0x70 (112 bytes)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040542Z","level":"TRACE","fields":{"message":"Chunked read, remaining=112"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040581Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(BodyCr, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040765Z","level":"TRACE","fields":{"message":"decode; state=Chunked(BodyCr, 0)"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040804Z","level":"TRACE","fields":{"message":"Read chunk hex size"},"target":"hyper::proto::h1::decode"}
foo-controll… │ {"timestamp":"2022-02-16T16:55:03.040847Z","level":"TRACE","fields":{"message":"flushed({role=client}): State { reading: Body(Chunked(Size, 0)), writing: KeepAlive, keep_alive: Busy }"},"target":"hyper::proto::h1::conn"}

Update: I upgraded from 0.64 all the way to 0.69. I did not see the issue on any of the versions until 0.69

On 0.69 the issue does not appear right away only after a minute or 2.

Edit 1:

@teozkr diff between 0.64 and 0.69

diff --git a/services/session_operator/Cargo.lock b/services/session_operator/Cargo.lock
index 9931e18..bc29c24 100644
--- a/services/session_operator/Cargo.lock
+++ b/services/session_operator/Cargo.lock
@@ -100,6 +100,17 @@ version = "1.1.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa"
 
+[[package]]
+name = "backoff"
+version = "0.4.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b62ddb9cb1ec0a098ad4bbf9344d0713fa193ae1a80af55febcff2627b6a00c1"
+dependencies = [
+ "getrandom 0.2.4",
+ "instant",
+ "rand 0.8.5",
+]
+
 [[package]]
 name = "backtrace"
 version = "0.3.64"
@@ -387,16 +398,6 @@ dependencies = [
  "syn",
 ]
 
-[[package]]
-name = "dashmap"
-version = "4.0.2"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "e77a43b28d0668df09411cb0bc9a8c2adc40f9a048afe863e05fd43251e8e39c"
-dependencies = [
- "cfg-if",
- "num_cpus",
-]
-
 [[package]]
 name = "data-encoding"
 version = "2.3.2"
@@ -444,6 +445,12 @@ dependencies = [
  "winapi",
 ]
 
+[[package]]
+name = "dyn-clone"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ee2626afccd7561a06cf1367e2950c4718ea04565e20fb5029b6c7d8ad09abcf"
+
 [[package]]
 name = "ed25519"
 version = "1.3.0"
@@ -731,6 +738,12 @@ dependencies = [
  "pin-project-lite",
 ]
 
+[[package]]
+name = "http-range-header"
+version = "0.3.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "0bfe8eed0a9285ef776bb792479ea3834e8b94e13d615c2f66d03dd50a435a29"
+
 [[package]]
 name = "httparse"
 version = "1.6.0"
@@ -890,9 +903,9 @@ dependencies = [
 
 [[package]]
 name = "k8s-openapi"
-version = "0.13.1"
+version = "0.14.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "4f8de9873b904e74b3533f77493731ee26742418077503683db44e1b3c54aa5c"
+checksum = "0489fc937cc7616a9abfa61bf39c250d7e32e1325ef028c8d9278dd24ea395b3"
 dependencies = [
  "base64",
  "bytes",
@@ -904,9 +917,9 @@ dependencies = [
 
 [[package]]
 name = "kube"
-version = "0.64.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "84dcc2f8ca3f2427a72acc31fa9538159f6b33a97002e315a3fcd5323cf51a2b"
+checksum = "1f06ab5c1d0d0bffbc96240728792fbe0f50139cf1320c285c8dc8eb0b4a09da"
 dependencies = [
  "k8s-openapi",
  "kube-client",
@@ -917,9 +930,9 @@ dependencies = [
 
 [[package]]
 name = "kube-client"
-version = "0.64.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "8957106140aa24a76de3f7d005966f381b30a4cd6a9c003b3bba6828e9617535"
+checksum = "600ef6c2724d235b5fc725254e08d7bd1fd8de95976ca9c99d060b1cb7ce91e9"
 dependencies = [
  "base64",
  "bytes",
@@ -938,6 +951,7 @@ dependencies = [
  "openssl",
  "pem",
  "pin-project",
+ "secrecy",
  "serde",
  "serde_json",
  "serde_yaml",
@@ -952,9 +966,9 @@ dependencies = [
 
 [[package]]
 name = "kube-core"
-version = "0.64.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "2ec73e7d8e937dd055d962af06e635e262fdb6ed341c36ecf659d4fece0a8005"
+checksum = "2a87945c0eccb682d387c3400c50c3bd8750d80127919a0a25119d68e7bd5d0a"
 dependencies = [
  "chrono",
  "form_urlencoded",
@@ -962,6 +976,7 @@ dependencies = [
  "json-patch",
  "k8s-openapi",
  "once_cell",
+ "schemars",
  "serde",
  "serde_json",
  "thiserror",
@@ -969,9 +984,9 @@ dependencies = [
 
 [[package]]
 name = "kube-derive"
-version = "0.64.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "c6651bfae82bc23439da1099174b52bcbf68df065dc33317c912e3c5c5cea43c"
+checksum = "b8f6527537a5a33679223b354401e328f2d665d805f7c8468b73a7cd631789e9"
 dependencies = [
  "darling",
  "proc-macro2",
@@ -982,16 +997,18 @@ dependencies = [
 
 [[package]]
 name = "kube-runtime"
-version = "0.64.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "6b090d3d7b43e2d60fa93ca51b19fe9f2e05a5252c97880fe834f8fa9f2de605"
+checksum = "dc3a141c952b29f5ef3a1f0ec397dcf1a473be55f162bceac9ebec046c524330"
 dependencies = [
- "dashmap",
+ "ahash",
+ "backoff",
  "derivative",
  "futures",
  "json-patch",
  "k8s-openapi",
  "kube-client",
+ "parking_lot",
  "pin-project",
  "serde",
  "serde_json",
@@ -1630,6 +1647,30 @@ dependencies = [
  "winapi",
 ]
 
+[[package]]
+name = "schemars"
+version = "0.8.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "c6b5a3c80cea1ab61f4260238409510e814e38b4b563c06044edf91e7dc070e3"
+dependencies = [
+ "dyn-clone",
+ "schemars_derive",
+ "serde",
+ "serde_json",
+]
+
+[[package]]
+name = "schemars_derive"
+version = "0.8.8"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41ae4dce13e8614c46ac3c38ef1c0d668b101df6ac39817aebdaa26642ddae9b"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "serde_derive_internals",
+ "syn",
+]
+
 [[package]]
 name = "scopeguard"
 version = "1.1.0"
@@ -1646,6 +1687,16 @@ dependencies = [
  "untrusted",
 ]
 
+[[package]]
+name = "secrecy"
+version = "0.8.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9bd1c54ea06cfd2f6b63219704de0b9b4f72dcc2b8fdef820be6cd799780e91e"
+dependencies = [
+ "serde",
+ "zeroize",
+]
+
 [[package]]
 name = "security-framework"
 version = "2.6.1"
@@ -1699,6 +1750,17 @@ dependencies = [
  "syn",
 ]
 
+[[package]]
+name = "serde_derive_internals"
+version = "0.25.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1dbab34ca63057a1f15280bdf3c39f2b1eb1b54c17e98360e511637aef7418c6"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
 [[package]]
 name = "serde_json"
 version = "1.0.79"
@@ -2092,17 +2154,19 @@ dependencies = [
 
 [[package]]
 name = "tower-http"
-version = "0.1.3"
+version = "0.2.2"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "81eca72647e58054bbfa41e6f297c23436f1c60aff6e5eb38455a0f9ca420bb5"
+checksum = "33ed53e1fd082268841975cb39587fa9fca9a7a30da84f97818ef667c97f2872"
 dependencies = [
  "base64",
+ "bitflags",
  "bytes",
  "futures-core",
  "futures-util",
  "http",
  "http-body",
- "pin-project",
+ "http-range-header",
+ "pin-project-lite",
  "tower-layer",
  "tower-service",
  "tracing",

Edit 2

Diff between 0.68 - 0.69

diff --git a/services/session_operator/Cargo.lock b/services/session_operator/Cargo.lock
index bb70dab..bc29c24 100644
--- a/services/session_operator/Cargo.lock
+++ b/services/session_operator/Cargo.lock
@@ -917,9 +917,9 @@ dependencies = [
 
 [[package]]
 name = "kube"
-version = "0.68.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "d41f782ddd187a0d8965607679bbd741052106b75330696ce7d7712b13194d88"
+checksum = "1f06ab5c1d0d0bffbc96240728792fbe0f50139cf1320c285c8dc8eb0b4a09da"
 dependencies = [
  "k8s-openapi",
  "kube-client",
@@ -930,9 +930,9 @@ dependencies = [
 
 [[package]]
 name = "kube-client"
-version = "0.68.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "cced1a3e738c2a4bccb52d33066632369c668f366a71c9c58139d9360e1a341d"
+checksum = "600ef6c2724d235b5fc725254e08d7bd1fd8de95976ca9c99d060b1cb7ce91e9"
 dependencies = [
  "base64",
  "bytes",
@@ -966,9 +966,9 @@ dependencies = [
 
 [[package]]
 name = "kube-core"
-version = "0.68.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "6878656f5e349ef08b0a48d2b1841f1e68c4cf1ef42524feadad2f9a109dd888"
+checksum = "2a87945c0eccb682d387c3400c50c3bd8750d80127919a0a25119d68e7bd5d0a"
 dependencies = [
  "chrono",
  "form_urlencoded",
@@ -984,9 +984,9 @@ dependencies = [
 
 [[package]]
 name = "kube-derive"
-version = "0.68.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "73f5fa510a64791242047fea4d807a5da06514ee714cace4b942d38a1126f0a8"
+checksum = "b8f6527537a5a33679223b354401e328f2d665d805f7c8468b73a7cd631789e9"
 dependencies = [
  "darling",
  "proc-macro2",
@@ -997,9 +997,9 @@ dependencies = [
 
 [[package]]
 name = "kube-runtime"
-version = "0.68.0"
+version = "0.69.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "0990fe3ab89a1217e6a4d4952e1ab0a24783265ed413228efbc489a00f86b3da"
+checksum = "dc3a141c952b29f5ef3a1f0ec397dcf1a473be55f162bceac9ebec046c524330"
 dependencies = [
  "ahash",
  "backoff",

My plan is to upgrade from 0.64 - 0.69 one version at a time and see when I see the error. I’ll report back findings.

Hm, odd.

  1. Can you confirm that the bug is new in 0.69, and does not happen with 0.68?
  2. Would you mind running cargo update on the latest build of your app that does work fine for you (whether that ends up using kube-rs 0.64, 0.68, or something in between), and confirm that that still works?
  3. Would you mind posting the diff between the Cargo.locks of the newest working and oldest broken versions?