linkerd2: linkerd-proxy crashes with "supplied instant is later than self" (AWS EC2/EKS)

What is the issue?

Linkerd proxy crashes intermittently with the following error message:

thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
thread 'main' panicked at 'supplied instant is later than self', library/std/src/time.rs:281:48
stack backtrace:
0:     0x55ca07b4ba84 - <unknown>
1:     0x55ca0713d55c - <unknown>
 ...
37:     0x55ca0708129a - <unknown>
38:                0x0 - <unknown>
thread panicked while panicking. aborting.

How can it be reproduced?

Deploy linkerd 2.11.1-stable to AWS EKS and wait for crashes.

Logs, error output, etc

OS and kernel version

[ssm-user@ip-10-0-20-45 bin]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"PRETTY_NAME="Amazon Linux 2"ANSI_COLOR="0;33"CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Output for one core from /proc/cpuinfo

[ssm-user@ip-10-0-20-45 bin]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x800126c
cpu MHz         : 2199.758
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_ts
c rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr
8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt
nrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.51
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD EPYC 7571
stepping        : 2
microcode       : 0x800126c
cpu MHz         : 2199.758
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat nptnrip_save
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4399.51
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

hypervisor if the system is virtualized

[ssm-user@ip-10-0-20-45 bin]$ ls /sys/hypervisor/
[ssm-user@ip-10-0-20-45 bin]$

selected clock source

[ssm-user@ip-10-0-20-45 bin]$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

output of `linkerd check -o short`

13:36 $ linkerd check -o short
Linkerd core checks
===================


Status check results are √

Linkerd extensions checks
=========================


Status check results are √

Environment

Kubernetes Version: 1.21
Cluster Environment: AWS EKS
Host OS: Amazon Linux
Linkerd version: 2.11.1-stable

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (7 by maintainers)

Commits related to this issue

Ban uses of `Instant` operations that can panic When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Instant::{d... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
Avoid panics in uses of `Instant` We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hyper... — committed to olix0r/hyper by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hy... — committed to tower-rs/tower by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2... — committed to hyperium/h2 by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2... — committed to hyperium/h2 by olix0r 2 years ago
Avoid time operations that can panic We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in to... — committed to tower-rs/tower by olix0r 2 years ago
Ban uses of `Instant` operations that can panic (#1456) When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
Pin git deps for dependencies that have `Instant` workarounds tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
Pin git deps for dependencies that have `Instant` workarounds tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
Pin git deps for dependencies that have `Instant` workarounds (#1497) tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This ch... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago
Ban uses of `Instant` operations that can panic (#1456) When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Ins... — committed to linkerd/linkerd2-proxy by olix0r 2 years ago

Most upvoted comments

@fcrespofastly As mentioned previously, this is a bug between the Rust standard library and AWS Linux, which has a buggy time source. So it’s going to be difficult for us to completely eliminate this issue until it is fixed upstream.

That said, we’ve put in place workarounds in linkerd2-proxy and several ecosystem projects (tokio, tower, hyper) that should reduce the likelihood of encountering this bug. I’ve put up linkerd/linkerd2-proxy#1497 to take git dependencies while we wait for tokio & tower to do a proper release and I’ve published a proxy build with these changes.

You can use this build by setting namespace/workload annotations:

annotations:
  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: instant.495a51ae

Or set it globally by upgrading with the appropriate helm values

olix0r on Feb 14, 2022

Same here:

Cluster created with kops, only happening in one single instance and we’re using t3.2xlarge
Kernel: 5.11.0-1017-aws
OS: Ubuntu 20.04.3 LTS

fcrespofastly on Feb 11, 2022

Thanks @virenrshah. We’ve got a few workarounds that will become available as our dependencies release new versions. In the meantime, you could try engaging AWS support or reprovisioning impacted nodes.

I’m told that AWS has reproduced the issue but I’m not aware of how long it will take for fixes to be available on their side.

olix0r on Feb 10, 2022

linkerd2: linkerd-proxy crashes with "supplied instant is later than self" (AWS EC2/EKS)

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

About this issue

Commits related to this issue

Most upvoted comments

output of `linkerd check -o short`