ingress-nginx: NGINX ingress creating endless core dumps

NGINX Ingress controller version: v0.41.2

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS EKS

What happened: My NGINX ingress controllers started to created endless core dump files. This started to fill up some of my nodes’ filesystem, creating disk-pressure on them and started to evict other pods. I do not have any debug log set up or intentionally configured to create core dumps with NGINX.

What you expected to happen: Not sure if preventing core dumps is the right way, gdb output in the bottom.

How to reproduce it: Not sure I understand why it happens now. We do have autoscaling enabled and I don’t think we reach the resource limits, so not sure why it happens.

Anything else we need to know: I managed to copy the core dump, and tried to investigate it, but couldn’t find anything verbose about it:

GNU gdb (GDB) 7.11
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from nginx...done.
[New LWP 4969]
[New LWP 4981]
[New LWP 4971]
[New LWP 4979]
[New LWP 4974]
[New LWP 4982]
[New LWP 4983]
[New LWP 4977]
[New LWP 4970]
[New LWP 4972]
[New LWP 4988]
[New LWP 4986]
[New LWP 4973]
[New LWP 4980]
[New LWP 4978]
[New LWP 4976]
[New LWP 4975]
[New LWP 4984]
[New LWP 4989]
[New LWP 4987]
[New LWP 4985]
[New LWP 4990]
[New LWP 5001]
[New LWP 4994]
[New LWP 4995]
[New LWP 4991]
[New LWP 5000]
[New LWP 4992]
[New LWP 4999]
[New LWP 4996]
[New LWP 4993]
[New LWP 4997]
[New LWP 4998]

warning: Unexpected size of section `.reg-xstate/4969' in core file.

warning: Can't read pathname for load map: No error information.
Core was generated by `nginx: worker process                               '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/4969' in core file.
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
[Current thread is 1 (LWP 4969)]
(gdb) backtrace
#0  0x00007fd38a72d3ab in ?? () from /lib/libcrypto.so.1.1
#1  0x00007fd38a72be21 in ?? () from /lib/libcrypto.so.1.1
#2  0x00007fd38a72bf24 in ASN1_item_free () from /lib/libcrypto.so.1.1
#3  0x00007fd38a94e62b in SSL_SESSION_free () from /lib/libssl.so.1.1
#4  0x00007fd38a7e2cdc in OPENSSL_LH_doall_arg () from /lib/libcrypto.so.1.1
#5  0x00007fd38a94f76c in SSL_CTX_flush_sessions () from /lib/libssl.so.1.1
#6  0x00007fd38a965896 in ?? () from /lib/libssl.so.1.1
#7  0x00007fd38a959f48 in ?? () from /lib/libssl.so.1.1
#8  0x00007fd38a948ec2 in SSL_do_handshake () from /lib/libssl.so.1.1
#9  0x000055614f8c0174 in ngx_ssl_handshake (c=c@entry=0x7fd38a2c4418) at src/event/ngx_event_openssl.c:1694
#10 0x000055614f8c058d in ngx_ssl_handshake_handler (ev=0x7fd38a0ebc40) at src/event/ngx_event_openssl.c:2061
#11 0x000055614f8bac1f in ngx_epoll_process_events (cycle=0x55615199b2f0, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#12 0x000055614f8adc62 in ngx_process_events_and_timers (cycle=cycle@entry=0x55615199b2f0) at src/event/ngx_event.c:257
#13 0x000055614f8b82fc in ngx_worker_process_cycle (cycle=0x55615199b2f0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:774
#14 0x000055614f8b6233 in ngx_spawn_process (cycle=cycle@entry=0x55615199b2f0, proc=0x55614f8b81d2 <ngx_worker_process_cycle>, data=0x0, name=0x55614f9dae3f "worker process", respawn=respawn@entry=0) at src/os/unix/ngx_process.c:199
#15 0x000055614f8b73aa in ngx_reap_children (cycle=cycle@entry=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:641
#16 0x000055614f8b9036 in ngx_master_process_cycle (cycle=0x55615199b2f0) at src/os/unix/ngx_process_cycle.c:174
#17 0x000055614f88ba00 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:385

In the meantime, I added a LimitRange for default limit of ephemeral-storage of 10Gi to prevent it to reach max node storage (my pods reached ~60Gi storage usage only from core dumps)

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 33 (21 by maintainers)

Most upvoted comments

While recompiling openssl I’ve found workaround for this issue - edit nginx.conf ssl_session_cache builtin:1000 shared:SSL:10m; and drop builtin:1000. From docs:

builtin a cache built in OpenSSL; Use of the built-in cache can cause memory fragmentation.

using only shared cache without the built-in cache should be more efficient.

Unfortunately it is not exposed via some annotation, so have to edit template. There is even SO article for this. Interesting to know why builtin:1000 is hardcoded. I understand that this is not a fix for openssl issue, but maybe drop builtin from template for everybody, as it stated in docs?

yeah sure, I will open a new PR and add that as a configuration 😃

No, the core dumps only contained:

#0  0x00007fa81c6d9c59 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00000000000000a0 in ?? ()
#2  0x00007fff2b051e20 in ?? ()
#3  0x00007fff2b051db0 in ?? ()
#4  0x0000000000000000 in ?? ()

which doesn’t mean anything to me really.

It is more or less an educated guess based on that around the time we have this issue the logs were spammed with invalid certificate errors in the controller.

@tokers has the option to set worker_rlimit_core ever been added? We’re now facing this issue and more or less know the root cause for us (it’s a chain of user error in configuring a number of certificates, which cert-manager seems to endlessly re-try to validate via the http solver but fails because they’re not set up properly, which leads to ssl errors in nginx ingress which seems to lead to core dumps which fills disk and in the end everything is dead).

I realise ignoring the coredumps is hiding the issue, but in our scenario this would be much preferred to taking out the entire ingress with some misconfigured certs.

You could be hitting a limit or memory violation, hard to tell which until the core backtrace is explicit. Your earlier post shows ‘?’ symbol in gdb and then it shows crypto and then libssl. I am no developer so can’t help much but I thought what someone said elsewhere, that ‘?’ means you are missing symbols. And then crypto/ssl could mean all your TLS config was coming into play and nginx could not handle the size, as you say, you have thousands.

You can upgrade to most recent release of ingress-controller, check and verify, how to run gdb for nginx coredumps and post another backtrace that shows the size or any other details of that datastructure that its complaining about ;

Unexpected size of section `.reg-xstate/4969' in core file

Also you can try to replicate the size of objects in another cluster but try spreading the load.