ingress-nginx: Worker Segmentation Fault (v0.44.0)

NGINX Ingress controller version: 0.44.0

Kubernetes version (use kubectl version): 1.18.3 Environment:

  • AWS
  • Debian 10.8
  • Linux 5.4.96
  • Weave-net 2.6.5

What happened: We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:

2021/02/22 22:30:18 [alert] 27#27: worker process 38 exited on signal 11 (core dumped)

We discovered that the following message was being printed to the system log as well (timed with the worker exits):

Feb 22 22:30:18 ip-10-153-47-170 kernel: traps: nginx[24701] general protection fault ip:7f03278adc59 sp:7ffcdba0aa10 error:0 in ld-musl-x86_64.so.1[7f032789f000+48000]

I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.

We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.

I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:

Core was generated by `nginx: worker process is shutting down              '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  a_crash () at ./arch/x86_64/atomic_arch.h:108
108	./arch/x86_64/atomic_arch.h: No such file or directory.
[Current thread is 1 (LWP 38)]
(gdb) backtrace
#0  a_crash () at ./arch/x86_64/atomic_arch.h:108
#1  get_nominal_size (end=0x7f03273ee90c "", p=0x7f03273ec4d0 "") at src/malloc/mallocng/meta.h:169
#2  __libc_free (p=0x7f03273ec4d0) at src/malloc/mallocng/free.c:110
#3  0x00007f0327811f7b in lj_vm_ffi_call () from /usr/local/lib/libluajit-5.1.so.2
#4  0x00007f0327858077 in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1382
#5  0x00007f032786e38d in lj_cf_ffi_meta___call (L=0x7f032368ee58) at lib_ffi.c:230
#6  0x00007f032780fb45 in lj_BC_FUNCC () from /usr/local/lib/libluajit-5.1.so.2
#7  0x000056357cdf639a in ngx_http_lua_run_thread (L=L@entry=0x7f03236a5380, r=r@entry=0x7f0327403030, ctx=ctx@entry=0x7f0323459bb0, nrets=<optimized out>, nrets@entry=1)
    at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_util.c:1167
#8  0x000056357ce1530a in ngx_http_lua_timer_handler (ev=<optimized out>) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:650
#9  0x000056357ce13b03 in ngx_http_lua_abort_pending_timers (ev=0x7f032740d790) at /tmp/build/lua-nginx-module-138c1b96423aa26defe00fe64dd5760ef17e5ad8/src/ngx_http_lua_timer.c:899
#10 0x000056357cd1de6d in ngx_close_idle_connections (cycle=cycle@entry=0x7f0327407570) at src/core/ngx_connection.c:1352
#11 0x000056357cd3981c in ngx_worker_process_cycle (cycle=0x7f0327407570, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:791
#12 0x000056357cd376f1 in ngx_spawn_process (cycle=cycle@entry=0x7f0327407570, proc=proc@entry=0x56357cd396af <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x56357ce5ef4f "worker process", respawn=respawn@entry=-4)
    at src/os/unix/ngx_process.c:199
#13 0x000056357cd38310 in ngx_start_worker_processes (cycle=cycle@entry=0x7f0327407570, n=1, type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:378
#14 0x000056357cd3a359 in ngx_master_process_cycle (cycle=0x7f0327407570, cycle@entry=0x7f032740c210) at src/os/unix/ngx_process_cycle.c:234
#15 0x000056357cd0cad9 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 26
  • Comments: 117 (52 by maintainers)

Most upvoted comments

We also still have the issue on 0.47. Multiple crashes on ingresses patch/create:

2021/06/07 12:27:41 [alert] 59#59: worker process 700 exited on signal 11 (core dumped)

156095:Jun  7 12:27:09 XXXXXXXXX kernel: [14957845.655837] traps: nginx[15990] general protection ip:7f19693e2c59 sp:7fff5cf74be0 error:0 in ld-musl-x86_64.so.1[7f19693d4000+48000]

We have a lot of ingresses (280) and frequent patch/create (development server). Not a single crash since rollback to 0.43.

I can confirm we saw the same issue on AKS with version 0.45.0. Issue went away when we downgraded to 0.43.0.

Ok, I think I figured out the bug, I will fix it and kick out a new lua-resty-balancer release soon.

In short, there is a memory corruption bug in resty.balancer but we do not hit it usually. It happens due to the musl libc changed the malloc implementation in musl 1.2.1 (https://git.musl-libc.org/cgit/musl/commit/?h=v1.2.1&id=73cc775bee53300c7cf759f37580220b18ac13d3) which is first included in alpine 3.13.0 (https://www.alpinelinux.org/posts/Alpine-3.13.0-released.html) which is first included in ingress-nginx v0.44.0.

The information from the previous steps is still welcome, just in case I’m wrong.

@rikatz yeah, downgrade to v1.19.9 with the resolver patch should be a better choice.

Using the OpenResty release is always the recommended way. We usually need some fixes when upgrading the NGINX core for a new OpenResty release in my experience.

By the way, the docker image with the core file is still welcome.

In my gut, this segmentation fault has nothing to do with NGINX core (just guess, not sure yet).

Hi folks, I’m from OpenResty Community, one of the OpenResty Core Developers. I think I can help to debug this segmentation fault issue. Could someone help to:

  1. provide a docker image of the accident scene that contains a core file and all binary files in the running container, so that we can use gdb to debug it locally. If you do not want to provide it in public, you can send it to my personal email (doujiang24@gmail.com).

  2. or, install OpenResty XRay in the accident container, and share the XRay console URL with me. https://openresty.com/en/xray/

  3. or, use openresty-gdb-utils to get the Lua backtrace by yourself. https://github.com/openresty/openresty-gdb-utils

Here is my debug plan for now:

  1. get the Lua backtrace. It will show the running Lua module usually.
  2. get the C function that LuaJIT is calling by using ffi. here is my gdb step plan:
    frame 3
    info symbol *(long *)$rbx
    

we may dig more after getting the results in the previous steps.

hey folks,

Thanks for your help and patience.

Yes, maybe testing with the new alpine might be something. Alpine 3.14 got some changes in syscalls that might lead to other problems (we discussed that in #ingress-nginx-dev on slack) so we are really considering moving to Debian slim in future releases as well.

I’m pretty low bandwidth today, but I can release in my personal repo a ingress-nginx image using alpine 3.14 if someone wants to test, and maybe create an alternative Debian release so you can test, sounds good?

About LuaJIT I’ll defer to @moonming and @ElvinEfendi

We have couple of k8s clusters with stable rate of signal 11 errors on v0.43.0. I’ve tried to switch to libmimalloc.so, then to v0.49.0 + libmimalloc.so but it has no any effect. So i have to try build debian image for v0.49.0 based on work in https://github.com/kubernetes/ingress-nginx/pull/7593

# build image with nginx 1.20
git checkout rikatz/new-base-image
docker build -t nginx-debian images/nginx-debian/rootfs/

# use controller go binaries from original image
git checkout controller-v0.49.0
cat << EOF > rootfs/Dockerfile 
ARG BASE_IMAGE

FROM k8s.gcr.io/ingress-nginx/controller:v0.49.0 as orig
FROM ${BASE_IMAGE}

ARG TARGETARCH
ARG VERSION
ARG COMMIT_SHA
ARG BUILD_ID=UNSET

LABEL org.opencontainers.image.title="NGINX Ingress Controller for Kubernetes"
LABEL org.opencontainers.image.documentation="https://kubernetes.github.io/ingress-nginx/"
LABEL org.opencontainers.image.source="https://github.com/kubernetes/ingress-nginx"
LABEL org.opencontainers.image.vendor="The Kubernetes Authors"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.version="${VERSION}"
LABEL org.opencontainers.image.revision="${COMMIT_SHA}"

LABEL build_id="${BUILD_ID}"

WORKDIR  /etc/nginx

COPY --chown=www-data:www-data --from=orig /etc/nginx /etc/nginx
COPY --chown=www-data:www-data --from=orig /dbg /
COPY --chown=www-data:www-data --from=orig /nginx-ingress-controller /
COPY --chown=www-data:www-data --from=orig /wait-shutdown /

# Fix permission during the build to avoid issues at runtime
# with volumes (custom templates)
RUN bash -xeu -c ' \
  writeDirs=( \
    /etc/ingress-controller \
    /etc/ingress-controller/ssl \
    /etc/ingress-controller/auth \
    /var/log \
    /var/log/nginx \
  ); \
  for dir in "${writeDirs[@]}"; do \
    mkdir -p ${dir}; \
    chown -R www-data.www-data ${dir}; \
  done'

RUN apt-get update -qq && apt-get install -y libcap2-bin \
  && setcap    cap_net_bind_service=+ep /nginx-ingress-controller \
  && setcap -v cap_net_bind_service=+ep /nginx-ingress-controller \
  && setcap    cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
  && setcap -v cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
  && setcap    cap_net_bind_service=+ep /usr/bin/dumb-init \
  && setcap -v cap_net_bind_service=+ep /usr/bin/dumb-init \
  && apt-get purge -y libcap2 libcap2-bin libpam-cap

USER www-data

# Create symlinks to redirect nginx logs to stdout and stderr docker log collector
RUN  ln -sf /dev/stdout /var/log/nginx/access.log \
  && ln -sf /dev/stderr /var/log/nginx/error.log

ENTRYPOINT ["/usr/bin/dumb-init", "--"]

CMD ["/nginx-ingress-controller"]
EOF

# build image with controller ontop of nginx-debian
ARCH=amd64 BASE_IMAGE=nginx-debian REGISTRY=sepa make image

For who want to test, image is available as sepa/ingress-nginx-debian:v0.49.0. Unfortunately now error looks like:

Sep 10, 2021 @ 17:26:47.971 | 2021/09/10 14:26:47 [alert] 26#26: worker process 69 exited on signal 6
Sep 10, 2021 @ 17:26:47.904 | double free or corruption (!prev)

From metrics it looks like some kind of memory leak: Each of those dips for avg(nginx_ingress_controller_nginx_process_resident_memory_bytes) is signal 11 error. At 17:00 I switch 3 pods to debian based image. Then at 17:24 and 17:26 there are dips from signal 6 error image

/kind bug

Hey folks, we’ve just released controller v0.47.0 -> https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.47.0

Can you please confirm this still happens? It uses now nginx v1.20.1.

We are discussing again if we should start using Debian-slim instead of Alpine (Ingress NGINX already did that on past).

Thanks

/close

This one has been solved, thank you very much @doujiang24

I’m moving the discussion about ssl endless coredumps to #7080

@sepich It is likely to be another bug, maybe on the OpenSSL side. Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!

Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible.

The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it’s probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that’s also just a guess…

Thank you. Let me upgrade it.

Oh, seems there is a bug in chash_point_sort, I will verify it soon. The information from the previous steps is still welcome.

/remove-triage needs-information /remove-triage not-reproducible /triage accepted

Just learned about this issue the hard way! I confirm the issue is present on 0.46.0 as well. Planning a downgrade till the issue is fixed.

@sepich It is likely to be another bug, maybe on the OpenSSL side. Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks! Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible.

The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it’s probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that’s also just a guess…

A bit more than 48h into testing v0.49.2 (4 replicas), no SIGSEGV so far. In the same time 361 SIGSEGV on v0.44.0 (1 replica), so at least for me the issue seems to be gone and the fix in the lua-resty-balancer module helped. I took a look at nearly two dozens of coredumps and all crashed in lua code, not a single one in SSL related code, so I think I can’t help any further at the moment.

We have released v0.49.2 and v1.0.2 with the fix.

Can you all please test and give us some feedbacks?

Thank you so much, specially @doujiang24 and @tao12345666333 for the help!

I’m currently experimenting with v0.49.1 but until now there are no crashes (100m running). As soon as I get a crash with a dump I can send it to you @doujiang24 by e-mail but I’m not allowed to share it with the wide public.

Do you mean the segmentation fault errors?

Yes, [alert] 28#28: worker process 107 exited on signal 11

If yes, do you have core files?

Unfortinally, no(

It will be very helpful if you could provide a core file as said in #6896 (comment).

Will try today

Thank you.

Thanks @doujiang24 I have notified @rikatz on slack.

I will do the same on a test cluster and report back.

If I were someone interested in working on this (don’t really have time ATM for something that we no longer use) my next steps would be to build a new Alpine source container using a musl library compiled from git source, and then use that source image to build a nginx-controller image and test that. There are a number of commits since the musl 1.22. release that mention changes to malloc in some way and there is a decent chance that one of them fixes this issue.

I can confirm we saw the same issue on GCP/GKE with version 0.45.0. Issue also went away with 0.43.0.

From our Compute engine, we also found that:

[39968.569424] traps: nginx[1006038] general protection fault ip:7f6a54085c59 sp:7ffc2d3b6230 error:0 in ld-musl-x86_64.so.1[7f6a54077000+48000]\r\n 

On this cluster, we have a lot of ingress (~200). We didn’t see this issue on a similar cluster with quite similar ingress volume.

Hey; as the segfaults are relatively infrequent and difficult to reproduce - shouldn’t we be working with data that’s more readily accessible? As I demonstrated above, we’re able to observe a loosely double CPU usage between 0.43 and 0.44, it’s not a huge leap to say that whatever is causing that additional load is only going to exacerbate config reloads (already a high CPU event).

The CPU increase should be relatively trivial to reproduce. In the above example we’re running 6 pods with 500m CPU (no limits) with each pod doing around 250-300ops/sec.

@sepich It is likely to be another bug, maybe on the OpenSSL side. Does it possible to send me a core file? This one seems harder to debug than the previous one. So, a core file will be very appreciated, Thanks!

Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks!

Hi @derek-burdick Please follow these steps, Thanks!

use gdb:

  1. show the c land backtrace, by using bt, also bt full is helpful. like the backtrace in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604
  2. select the frame that is running lj_vm_ffi_call , by using frame N, N means the frame number in the backtrace. N is 3 for the backtrace sample in https://github.com/kubernetes/ingress-nginx/issues/6896#issue-813960604
  3. show the function that ffi is calling by using info symbol *(long *)$rbx.

use the lbt command in openresty-gdb-utils (https://github.com/openresty/openresty-gdb-utils#lbt) to get the Lua backtrace.

Also, more ways that could be easier as said in https://github.com/kubernetes/ingress-nginx/issues/6896#issuecomment-918845073, like use OpenResty XRay.

Hi folks. The latest releases (0.49.1 and 1.0.1) have been downgraded to nginx v1.19.9 and alpine was upgraded to 3.14.2

Can you please test it and provide us some feedback whether the problem persists?

Thank you so much!

@longwuyuan I will try to have some time next week to test & provide core dump

Thanks @laghoule and please keep me posted!

Hi folks.

We are planning some actions on this:

  • We are updating all the Lua* stack (and other modules) in nginx and planning to release a new version later next week.
  • As soon as we can get rid of the v1 madness (probably next week as well) I plan to start doing some tests with debian-slim instead of alpine

To help me reproduce this:

  • Does this case occurs in an environment with too many ingress/endpoints objects, or we can reproduce in a single ingress environment?
  • Does this happens on a high load environment? If I use vegeta or any other stress test tool, can I see the coredumps? /assign

Still happens on 0.47 for us. We got two crashes within a few hours. Error:

I0608 05:45:39.848955       6 controller.go:163] "Backend successfully reloaded"
I0608 05:45:39.849571       6 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"xxx", Name:"nginx-ingress-controller-57fb9bd94d-54545", UID:"20a138f0-3f03-40aa-b0ae-d6fb45ed92b5", APIVersion:"v1", ResourceVersion:"796859441", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
W0608 05:45:39.858318       6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
W0608 05:45:39.858334       6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
W0608 05:45:39.858393       6 nginx.go:1182] the server xxx has SSL configured but the SSL certificate does not contains a CN for xxx. Redirects will not work for HTTPS to HTTPS
2021/06/08 05:45:41 [alert] 28#28: worker process 38939 exited on signal 11 (core dumped)
2021/06/08 05:45:41 [alert] 28#28: worker process 38938 exited on signal 11 (core dumped)
2021/06/08 05:45:42 [alert] 28#28: worker process 39006 exited on signal 11 (core dumped)

Maybe slightly off topic, as I don’t know if the CPU spikes we saw were 100% caused by something in alpine, but would it be worth to provide a debian based image as an option? From what I gathered from #6527, the motivations were:

  1. Smaller image size
  2. Minimal CVE exposures

But would the same goals still be achievable with some trimmed down versions of debian? Like distroless?

Although I still love alpine for a lot of things, I have also moved away from it for many projects due to some famous issues like performance hit (mainly because musl libc I think), networking and DNS problems(e.g. https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues)

@alfianabdi Well, first of all the ?? in your backtrace indicates that you don’t have all of the debugging symbols installed.

Run the following to install the musl debug symbols prior to running that gdb command:

apk add musl-dbg