cilium: CI: K8sVerifier Runs the kernel verifier against Cilium's BPF datapath: libbpf: Error in bpf_object__probe_loading():Operation not permitted(1)
Test Name
K8sVerifier Runs the kernel verifier against Cilium's BPF datapath
Failure Output
FAIL: Failed to load BPF program bpf_lxc with datapath configuration:
-DSKIP_DEBUG=1 -DENABLE_IPV4=1 -DENABLE_IPV6=1 -DENABLE_HOST_SERVICES_TCP=1 -DENABLE_HOST_SERVICES_UDP=1 -DENABLE_HOST_REDIRECT=1 -DENABLE_ROUTING=1 -DNO_REDIRECT=1 -DPOLICY_VERDICT_NOTIFY=1 -DALLOW_ICMP_FRAG_NEEDED=1 -DENABLE_IDENTITY_MARK=1 -DMONITOR_AGGREGATION=3 -DCT_REPORT_FLAGS=0x0002 -DENABLE_HOST_FIREWALL=1 -DENABLE_ICMP_RULE=1 -DENABLE_CUSTOM_CALLS=1 -DENABLE_IPSEC=1 -DIP_POOLS=1 -DENCAP_IFINDEX=1 -DTUNNEL_MODE=1
Stack Trace
Expected command: kubectl exec -n default test-verifier -- env TC_PROGS="" XDP_PROGS="" CG_PROGS="" TC_PROGS="bpf_lxc" ./test/bpf/verifier-test.sh
To succeed, but it failed:
Exitcode: 1
Err: exit status 1
Standard Output
=> Loading bpf_lxc.c:from-container...
Standard Error
[..]
libbpf: prog '__send_drop_notify': unrecognized ELF section name '2/1'
libbpf: prog 'tail_icmp6_send_echo_reply': unrecognized ELF section name '2/3'
libbpf: prog 'tail_icmp6_send_time_exceeded': unrecognized ELF section name '2/5'
libbpf: prog 'tail_icmp6_handle_ns': unrecognized ELF section name '2/4'
libbpf: prog 'tail_handle_ipv6_cont': unrecognized ELF section name '2/26'
libbpf: prog 'tail_ipv6_ct_egress': unrecognized ELF section name '2/32'
libbpf: prog 'tail_handle_ipv6': unrecognized ELF section name '2/10'
libbpf: prog 'tail_handle_ipv4_cont': unrecognized ELF section name '2/25'
libbpf: prog 'tail_ipv4_ct_egress': unrecognized ELF section name '2/29'
libbpf: prog 'tail_handle_ipv4': unrecognized ELF section name '2/7'
libbpf: prog 'tail_handle_arp': unrecognized ELF section name '2/6'
libbpf: prog 'handle_xgress': unrecognized ELF section name 'from-container'
libbpf: prog 'tail_ipv6_policy': unrecognized ELF section name '2/12'
libbpf: prog 'tail_ipv6_to_endpoint': unrecognized ELF section name '2/14'
libbpf: prog 'tail_ipv6_ct_ingress_policy_only': unrecognized ELF section name '2/31'
libbpf: prog 'tail_ipv6_ct_ingress': unrecognized ELF section name '2/30'
libbpf: prog 'tail_ipv4_policy': unrecognized ELF section name '2/11'
libbpf: prog 'tail_ipv4_to_endpoint': unrecognized ELF section name '2/13'
libbpf: prog 'tail_ipv4_ct_ingress_policy_only': unrecognized ELF section name '2/28'
libbpf: prog 'tail_ipv4_ct_ingress': unrecognized ELF section name '2/27'
libbpf: prog 'handle_policy': unrecognized ELF section name '1/0xffff'
libbpf: prog 'handle_to_container': unrecognized ELF section name 'to-container'
libbpf: Error in bpf_object__probe_loading():Operation not permitted(1). Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value.
libbpf: failed to load object './test/bpf/../../bpf/bpf_lxc.o'
Unable to load program
command terminated with exit code 1
Resources
- Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.9/1911/
- ZIP file(s): test_results_Cilium-PR-K8s-1.16-kernel-4.9_1911_BDD-Test-PR.zip
Anything else?
Only observed on 4.9 so far
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 23 (23 by maintainers)
Commits related to this issue
- libbpf: Opt-in for rlimit bump We are hitting -EPERM errors when trying to load some programs in Cilium's CI, and this seems to be due to the succession of program loads hitting the memlock rlimit va... — committed to qmonnet/iproute2 by qmonnet 2 years ago
- libbpf: Opt-in for rlimit bump We are hitting -EPERM errors when trying to load some programs in Cilium's CI, and this seems to be due to the succession of program loads hitting the memlock rlimit va... — committed to isovalent/iproute2 by qmonnet 2 years ago
- iproute2: Bump iproute2/libbpf to get rlimit bump Context: https://github.com/cilium/cilium/issues/20288#issuecomment-1185551102 To work around the -EPERM returned when trying to load programs in th... — committed to cilium/image-tools by qmonnet 2 years ago
- iproute2: Bump iproute2/libbpf to get rlimit bump Context: https://github.com/cilium/cilium/issues/20288#issuecomment-1185551102 To work around the -EPERM returned when trying to load programs in th... — committed to cilium/image-tools by qmonnet 2 years ago
- test: Update test-verifier image Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the rlimit and thus sometimes fail to load BPF programs and maps. ... — committed to pchaigno/cilium by pchaigno 2 years ago
- test: Update test-verifier image Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the rlimit and thus sometimes fail to load BPF programs and maps. ... — committed to cilium/cilium by pchaigno 2 years ago
- test: Update test-verifier image [ upstream commit e3e1a299678c775fc1364f038506d313bcd975b0 ] Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the ... — committed to ldelossa/cilium by pchaigno 2 years ago
- test: Update test-verifier image [ upstream commit e3e1a299678c775fc1364f038506d313bcd975b0 ] Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the ... — committed to aanm/cilium by pchaigno 2 years ago
- test: Update test-verifier image [ upstream commit e3e1a299678c775fc1364f038506d313bcd975b0 ] Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the ... — committed to cilium/cilium by pchaigno 2 years ago
- test: Update test-verifier image Our cilium/iproute2 and cilium/libbpf dependencies were updated to fix an issue where we don't bump the rlimit and thus sometimes fail to load BPF programs and maps. ... — committed to dezmodue/cilium by pchaigno 2 years ago
- provision: bump 4.9 kernel package to 4.9.326 This version contains a fix [1], [2] (courtesy of Quentin) for a verifier issue [3] we've been seeing in Cilium CI. [1] https://lore.kernel.org/stable/2... — committed to cilium/packer-ci-build by tklauser 2 years ago
- provision: Bump 4.9 kernel packages (4.9.326, to fix a kernel bug) We have been hitting a kernel bug on 4.9 for the verifier tests. An underflow on the memlock rlimit counter, caused by the reallocat... — committed to cilium/packer-ci-build by qmonnet 2 years ago
- provision: Bump 4.9 kernel packages (4.9.326, to fix a kernel bug) We have been hitting a kernel bug on 4.9 for the verifier tests. An underflow on the memlock rlimit counter, caused by the reallocat... — committed to cilium/packer-ci-build by qmonnet 2 years ago
- vagrant: Bump 4.9 Vagrant box (Linux 4.9.326, to fix a kernel bug) We have been hitting a kernel bug on 4.9 for the verifier tests. An underflow on the memlock rlimit counter, caused by the reallocat... — committed to qmonnet/cilium by qmonnet 2 years ago
- vagrant: Bump 4.9 Vagrant box (Linux 4.9.326, to fix a kernel bug) We have been hitting a kernel bug on 4.9 for the verifier tests. An underflow on the memlock rlimit counter, caused by the reallocat... — committed to cilium/cilium by qmonnet 2 years ago
- vagrant: Bump 4.9 Vagrant box (Linux 4.9.326, to fix a kernel bug) [ upstream commit 07e7fb0073ab387108ac6b4c126df1a34e36d5d2 ] (Backporters note: only update the v4.9 image, not the cilium-dev imag... — committed to cilium/cilium by tklauser 2 years ago
- vagrant: Bump 4.9 Vagrant box (Linux 4.9.326, to fix a kernel bug) [ upstream commit 07e7fb0073ab387108ac6b4c126df1a34e36d5d2 ] (Backporters note: only update the v4.9 image, not the cilium-dev imag... — committed to cilium/cilium by tklauser 2 years ago
Got it!!
This is an underflow on the rlimit counter indeed. Adding
printk()s to the kernel, I can observe four program loads charged for 6 pages, but then uncharged for 7. After uncharging those, the rlimit counter (user->locked_vm) is at18446744073709551612instead of0.Looking at the kernel code,
prog->pages(the value changing from 6 to 7) can indeed be modified if the BPF program is reallocated. This happens for example if we add new instructions and they don’t fit on the last page used by the program. Looking further, this has been fixed (hi Daniel!) on newer versions:It is likely that the change in the compilation options from https://github.com/cilium/cilium/pull/19938/commits brought the number of instructions of a program just under a multiple of
PAGE_SIZE, and it goes over the threshold when the verifier adds the prologue or patches the context accesses.Now for the bad news: Daniel’s fix is in 4.10, but apparently it was never backported to 4.9. I suppose the cleanest way to fix this would be to send a backport, and to update the image to 4.9.y after the patch has been merged.
Greg took the patch, it’s currently in the queue for 4.9.
Edit 2022-08-17: Made it to the
queue/4.9of linux-stable-rc. Should be in v4.9.326.Backport submitted to the stable branch for 4.9.
It seems that the rlimit bump is not happening.
I expected either libbpf or tc directly to do it. It turns aout that the rlimit bump in libbpf is more recent than I thought, and our libbpf fork doesn’t have it. Then iproute2’s tc does have a rlimit bump, but in lib/bpf_legacy.c.
So what probably happened is that at some point our iproute2 version switched to libbpf and stopped raising the rlimit itself; but the libbpf version it uses is ~4 months too old to have libbpf’s rlimit bump. So probably no component raises the rlimit, and sometimes, it’s not enough.
I’m not aware of any difference about rlimit handling on kernel versions (other than the switch to cgroup-based obviously), although we could imagine that the delay for reclaiming the memory when programs/maps are unloaded could be slightly longer on old kernels due to implementation details. Looks like a race anyway, since it doesn’t trigger all the time.
Some potential workarounds:
ulimit -lin terminal before launching the tests?