linkerd2: linkerd-proxy crashes with "supplied instant is later than self" (AWS EC2/EKS)
What is the issue?
Linkerd proxy crashes intermittently with the following error message:
thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
stack backtrace:
0: 0x55ca07b4ba84 - <unknown>
1: 0x55ca0713d55c - <unknown>
...
37: 0x55ca0708129a - <unknown>
38: 0x0 - <unknown>
thread panicked while panicking. aborting.
How can it be reproduced?
Deploy linkerd 2.11.1-stable to AWS EKS and wait for crashes.
Logs, error output, etc
- OS and kernel version
[ssm-user@ip-10-0-20-45 bin]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"PRETTY_NAME="Amazon Linux 2"ANSI_COLOR="0;33"CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
- Output for one core from /proc/cpuinfo
[ssm-user@ip-10-0-20-45 bin]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7571
stepping : 2
microcode : 0x800126c
cpu MHz : 2199.758
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_ts
c rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr
8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt
nrip_save
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4399.51
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7571
stepping : 2
microcode : 0x800126c
cpu MHz : 2199.758
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat nptnrip_save
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 4399.51
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
- hypervisor if the system is virtualized
[ssm-user@ip-10-0-20-45 bin]$ ls /sys/hypervisor/
[ssm-user@ip-10-0-20-45 bin]$
- selected clock source
[ssm-user@ip-10-0-20-45 bin]$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
output of linkerd check -o short
13:36 $ linkerd check -o short
Linkerd core checks
===================
Status check results are √
Linkerd extensions checks
=========================
Status check results are √
Environment
- Kubernetes Version: 1.21
- Cluster Environment: AWS EKS
- Host OS: Amazon Linux
- Linkerd version: 2.11.1-stable
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (7 by maintainers)
Commits related to this issue
- Ban uses of `Instant` operations that can panic When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Instant::{d... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
- Avoid panics in uses of `Instant` We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hyper... — committed to olix0r/hyper by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hy... — committed to tower-rs/tower by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2... — committed to hyperium/h2 by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2... — committed to hyperium/h2 by olix0r 2 years ago
- Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
- Ban uses of `Instant` operations that can panic (#1456) When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
- Pin git deps for dependencies that have `Instant` workarounds tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
- Pin git deps for dependencies that have `Instant` workarounds tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
- Pin git deps for dependencies that have `Instant` workarounds (#1497) tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This ch... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
- Ban uses of `Instant` operations that can panic (#1456) When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Ins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
@fcrespofastly As mentioned previously, this is a bug between the Rust standard library and AWS Linux, which has a buggy time source. So it’s going to be difficult for us to completely eliminate this issue until it is fixed upstream.
That said, we’ve put in place workarounds in linkerd2-proxy and several ecosystem projects (tokio, tower, hyper) that should reduce the likelihood of encountering this bug. I’ve put up linkerd/linkerd2-proxy#1497 to take git dependencies while we wait for tokio & tower to do a proper release and I’ve published a proxy build with these changes.
You can use this build by setting namespace/workload annotations:
Or set it globally by upgrading with the appropriate helm values
Same here:
t3.2xlarge
Thanks @virenrshah. We’ve got a few workarounds that will become available as our dependencies release new versions. In the meantime, you could try engaging AWS support or reprovisioning impacted nodes.
I’m told that AWS has reproduced the issue but I’m not aware of how long it will take for fixes to be available on their side.