ingress-nginx: Worker Segmentation Fault (v0.44.0)
NGINX Ingress controller version: 0.44.0
Kubernetes version (use kubectl version):
1.18.3
Environment:
- AWS
- Debian 10.8
- Linux 5.4.96
- Weave-net 2.6.5
What happened: We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:
2021/02/22 22:30:18 [alert] 27#27: worker process 38 exited on signal 11 (core dumped)
We discovered that the following message was being printed to the system log as well (timed with the worker exits):
Feb 22 22:30:18 ip-10-153-47-170 kernel: traps: nginx[24701] general protection fault ip:7f03278adc59 sp:7ffcdba0aa10 error:0 in ld-musl-x86_64.so.1[7f032789f000+48000]
I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.
We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.
I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:
Core was generated by `nginx: worker process is shutting down '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 a_crash () at ./arch/x86_64/atomic_arch.h:108
108 ./arch/x86_64/atomic_arch.h: No such file or directory.
[Current thread is 1 (LWP 38)]
(gdb) backtrace
#0 a_crash () at ./arch/x86_64/atomic_arch.h:108
#1 get_nominal_size (end=0x7f03273ee90c "", p=0x7f03273ec4d0 "") at src/malloc/mallocng/meta.h:169
#2 __libc_free (p=0x7f03273ec4d0) at src/malloc/mallocng/free.c:110
#3 0x00007f0327811f7b in lj_vm_ffi_call () from /usr/local/lib/libluajit-5.1.so.2
#4 0x00007f0327858077 in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1382
#5 0x00007f032786e38d in lj_cf_ffi_meta___call (L=0x7f032368ee58) at lib_ffi.c:230
#6 0x00007f032780fb45 in lj_BC_FUNCC () from /usr/local/lib/libluajit-5.1.so.2
#7 0x000056357cdf639a in ngx_http_lua_run_thread (L=L@entry=0x7f03236a5380, r=r@entry=0x7f0327403030, ctx=ctx@entry=0x7f0323459bb0, nrets=<optimized out>, nrets@entry=1)
at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_util.c:1167
#8 0x000056357ce1530a in ngx_http_lua_timer_handler (ev=<optimized out>) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:650
#9 0x000056357ce13b03 in ngx_http_lua_abort_pending_timers (ev=0x7f032740d790) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:899
#10 0x000056357cd1de6d in ngx_close_idle_connections (cycle=cycle@entry=0x7f0327407570) at src/core/ngx_connection.c:1352
#11 0x000056357cd3981c in ngx_worker_process_cycle (cycle=0x7f0327407570, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:791
#12 0x000056357cd376f1 in ngx_spawn_process (cycle=cycle@entry=0x7f0327407570, proc=proc@entry=0x56357cd396af <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x56357ce5ef4f "worker process", respawn=respawn@entry=-4)
at src/os/unix/ngx_process.c:199
#13 0x000056357cd38310 in ngx_start_worker_processes (cycle=cycle@entry=0x7f0327407570, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:378
#14 0x000056357cd3a359 in ngx_master_process_cycle (cycle=0x7f0327407570, cycle@entry=0x7f032740c210) at src/os/unix/ngx_process_cycle.c:234
#15 0x000056357cd0cad9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386
One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?
/kind bug
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 26
- Comments: 117 (52 by maintainers)
We also still have the issue on 0.47. Multiple crashes on ingresses patch/create:
We have a lot of ingresses (280) and frequent patch/create (development server). Not a single crash since rollback to 0.43.
I can confirm we saw the same issue on AKS with version 0.45.0. Issue went away when we downgraded to 0.43.0.
Ok, I think I figured out the bug, I will fix it and kick out a new lua-resty-balancer release soon.
In short, there is a memory corruption bug in
resty.balancerbut we do not hit it usually. It happens due to the musl libc changed the malloc implementation in musl 1.2.1 (https://git.musl-libc.org/cgit/musl/commit/?h=v1.2.1&id=73cc775bee53300c7cf759f37580220b18ac13d3) which is first included in alpine 3.13.0 (https://www.alpinelinux.org/posts/Alpine-3.13.0-released.html) which is first included in ingress-nginx v0.44.0.The information from the previous steps is still welcome, just in case I’m wrong.
@rikatz yeah, downgrade to v1.19.9 with the resolver patch should be a better choice.
Using the OpenResty release is always the recommended way. We usually need some fixes when upgrading the NGINX core for a new OpenResty release in my experience.
By the way, the docker image with the core file is still welcome.
In my gut, this segmentation fault has nothing to do with NGINX core (just guess, not sure yet).
Hi folks, I’m from OpenResty Community, one of the OpenResty Core Developers. I think I can help to debug this segmentation fault issue. Could someone help to:
provide a docker image of the accident scene that contains a core file and all binary files in the running container, so that we can use gdb to debug it locally. If you do not want to provide it in public, you can send it to my personal email (doujiang24@gmail.com).
or, install OpenResty XRay in the accident container, and share the XRay console URL with me. https://openresty.com/en/xray/
or, use openresty-gdb-utils to get the Lua backtrace by yourself. https://github.com/openresty/openresty-gdb-utils
Here is my debug plan for now:
we may dig more after getting the results in the previous steps.
hey folks,
Thanks for your help and patience.
Yes, maybe testing with the new alpine might be something. Alpine 3.14 got some changes in syscalls that might lead to other problems (we discussed that in #ingress-nginx-dev on slack) so we are really considering moving to Debian slim in future releases as well.
I’m pretty low bandwidth today, but I can release in my personal repo a ingress-nginx image using alpine 3.14 if someone wants to test, and maybe create an alternative Debian release so you can test, sounds good?
About LuaJIT I’ll defer to @moonming and @ElvinEfendi
We have couple of k8s clusters with stable rate of
signal 11errors on v0.43.0. I’ve tried to switch tolibmimalloc.so, then to v0.49.0 +libmimalloc.sobut it has no any effect. So i have to try build debian image for v0.49.0 based on work in https://github.com/kubernetes/ingress-nginx/pull/7593For who want to test, image is available as
sepa/ingress-nginx-debian:v0.49.0. Unfortunately now error looks like:From metrics it looks like some kind of memory leak: Each of those dips for
avg(nginx_ingress_controller_nginx_process_resident_memory_bytes)issignal 11error. At 17:00 I switch 3 pods to debian based image. Then at 17:24 and 17:26 there are dips fromsignal 6error/kind bug
Hey folks, we’ve just released controller v0.47.0 -> https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.47.0
Can you please confirm this still happens? It uses now nginx v1.20.1.
We are discussing again if we should start using Debian-slim instead of Alpine (Ingress NGINX already did that on past).
Thanks
/close
This one has been solved, thank you very much @doujiang24
I’m moving the discussion about ssl endless coredumps to #7080
Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible.
The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it’s probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that’s also just a guess…
Thank you. Let me upgrade it.
Oh, seems there is a bug in
chash_point_sort, I will verify it soon. The information from the previous steps is still welcome./remove-triage needs-information /remove-triage not-reproducible /triage accepted
Just learned about this issue the hard way! I confirm the issue is present on 0.46.0 as well. Planning a downgrade till the issue is fixed.
A bit more than 48h into testing v0.49.2 (4 replicas), no SIGSEGV so far. In the same time 361 SIGSEGV on v0.44.0 (1 replica), so at least for me the issue seems to be gone and the fix in the lua-resty-balancer module helped. I took a look at nearly two dozens of coredumps and all crashed in lua code, not a single one in SSL related code, so I think I can’t help any further at the moment.
We have released v0.49.2 and v1.0.2 with the fix.
Can you all please test and give us some feedbacks?
Thank you so much, specially @doujiang24 and @tao12345666333 for the help!
I’m currently experimenting with v0.49.1 but until now there are no crashes (100m running). As soon as I get a crash with a dump I can send it to you @doujiang24 by e-mail but I’m not allowed to share it with the wide public.
Yes,
[alert] 28#28: worker process 107 exited on signal 11Unfortinally, no(
Will try today
Thank you.
Thanks @doujiang24 I have notified @rikatz on slack.
I will do the same on a test cluster and report back.
If I were someone interested in working on this (don’t really have time ATM for something that we no longer use) my next steps would be to build a new Alpine source container using a musl library compiled from git source, and then use that source image to build a nginx-controller image and test that. There are a number of commits since the musl 1.22. release that mention changes to malloc in some way and there is a decent chance that one of them fixes this issue.
I can confirm we saw the same issue on GCP/GKE with version 0.45.0. Issue also went away with 0.43.0.
From our Compute engine, we also found that:
On this cluster, we have a lot of ingress (~200). We didn’t see this issue on a similar cluster with quite similar ingress volume.
Hey; as the segfaults are relatively infrequent and difficult to reproduce - shouldn’t we be working with data that’s more readily accessible? As I demonstrated above, we’re able to observe a loosely double CPU usage between 0.43 and 0.44, it’s not a huge leap to say that whatever is causing that additional load is only going to exacerbate config reloads (already a high CPU event).
The CPU increase should be relatively trivial to reproduce. In the above example we’re running 6 pods with 500m CPU (no limits) with each pod doing around 250-300ops/sec.
@sepich It is likely to be another bug, maybe on the OpenSSL side. Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!
Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!
Hi @derek-burdick Please follow these steps, Thanks!
use gdb:
bt, alsobt fullis helpful. like the backtrace in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604lj_vm_ffi_call, by usingframe N,Nmeans the frame number in the backtrace.Nis 3 for the backtrace sample in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604info symbol *(long *)$rbx.use the
lbtcommand in openresty-gdb-utils (https://github.com/openresty/openresty-gdb-utils#lbt) to get the Lua backtrace.Also, more ways that could be easier as said in https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-918845073, like use OpenResty XRay.
Hi folks. The latest releases (0.49.1 and 1.0.1) have been downgraded to nginx v1.19.9 and alpine was upgraded to 3.14.2
Can you please test it and provide us some feedback whether the problem persists?
Thank you so much!
@longwuyuan I will try to have some time next week to test & provide core dump
Thanks @laghoule and please keep me posted!
Hi folks.
We are planning some actions on this:
To help me reproduce this:
Still happens on 0.47 for us. We got two crashes within a few hours. Error:
Maybe slightly off topic, as I don’t know if the CPU spikes we saw were 100% caused by something in alpine, but would it be worth to provide a debian based image as an option? From what I gathered from #6527, the motivations were:
But would the same goals still be achievable with some trimmed down versions of debian? Like distroless?
Although I still love alpine for a lot of things, I have also moved away from it for many projects due to some famous issues like performance hit (mainly because musl libc I think), networking and DNS problems(e.g. https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues)
@alfianabdi Well, first of all the
??in your backtrace indicates that you don’t have all of the debugging symbols installed.Run the following to install the musl debug symbols prior to running that gdb command: