unbound: null pointer in services/outside_network.c:160 reuse_cmp / rbtree_find_less_equal in Unbound 1.13.0 release

Using my amd64 linux on centos7 build from https://github.com/NLnetLabs/unbound/issues/393#issuecomment-760618418 with these commits added to 1.13.0:

4d51c6b
08968ba
422213c
7e46204bf73ecdee56e1ab1c48e1829d71cdbc0a

I am still getting rare crashes. I’ve caught one here, in reuse_cmp having a nullptr for key2, coming from node->key: https://github.com/NLnetLabs/unbound/blob/ca497815b82587d5e7db7ddb83e9c30fa68585f8/util/rbtree.c#L525-L528

The backtrace is:

(gdb) bt
#0  reuse_cmp_addrportssl (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:144
#1  0x000055e21f6ba7c1 in   (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
#2  0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
#3  0x000055e21f6baf0c in reuse_tcp_find (outnet=outnet@entry=0x7fdd7a428090, addr=addr@entry=0x7fdd6718d6f0, addrlen=16, use_ssl=<optimized out>) at services/outside_network.c:480
#4  0x000055e21f6bbf5f in use_free_buffer (outnet=outnet@entry=0x7fdd7a428090) at services/outside_network.c:723
#5  0x000055e21f6bc4fb in outnet_tcp_cb (c=0x7fdd61fd0ce0, arg=0x7fdd61fd0bb0, error=<optimized out>, reply_info=0x7fdd61fd0d18) at services/outside_network.c:1095
#6  0x000055e21f6b4087 in tcp_callback_reader (c=0x7fdd61fd0ce0) at util/netevent.c:1144
#7  0x000055e21f6b5548 in comm_point_tcp_handle_read (fd=217, c=0x7fdd61fd0ce0, short_ok=0) at util/netevent.c:1668
#8  0x000055e21f6b584b in comm_point_tcp_handle_callback (fd=217, event=<optimized out>, arg=0x7fdd61fd0ce0) at util/netevent.c:2062
#9  0x00007fde30723a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x000055e21f6b1fac in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#11 0x000055e21f62e499 in worker_work (worker=worker@entry=0x55e2216803b0) at daemon/worker.c:1941
#12 0x000055e21f6222bf in thread_start (arg=0x55e2216803b0) at daemon/daemon.c:540
#13 0x00007fde3009bea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fde2fdc496d in clone () from /lib64/libc.so.6

At the null pointer, *node is normal except for the null:

(gdb) up
#1  0x000055e21f6ba7c1 in reuse_cmp (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
160		r = reuse_cmp_addrportssl(key1, key2);
(gdb) up
#2  0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
527			r = rbtree->cmp(key, node->key);
(gdb) p *node
$2 = {parent = 0x7fdd61a09468, left = 0x55e21f922460 <rbtree_null_node>, right = 0x55e21f922460 <rbtree_null_node>, key = 0x0, color = 1 '\001'}

The core file is 6.36 GB; I can certainly share it and the centos7 rpm files out-of-band if you’d like to investigate directly, or I am happy to dig around in the core file in response to your questions. Thanks again!

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 35 (16 by maintainers)

Commits related to this issue

- Attempt to fix NULL keys in the reuse_tcp tree; relates to #411. — committed to NLnetLabs/unbound by gthess 3 years ago
Merge remote-tracking branch 'nlnet/master' * nlnet/master: - Fix for Python 3.9, no longer use deprecated functions of PyEval_CallObject (now PyObject_Call), PyEval_InitThreads (now none), PyP... — committed to jedisct1/unbound by jedisct1 3 years ago
- Debug output for #411 and #439: printout internal error and details. — committed to NLnetLabs/unbound by wcawijngaards 3 years ago
- Fix for #411: Depth protect for crash on deleted element timeout. — committed to NLnetLabs/unbound by wcawijngaards 3 years ago
Merge remote-tracking branch 'nlnet/master' * nlnet/master: (61 commits) - Fix that testcode dohclient has OpenSSL initialisation calls. - Further fix for #468: detect SSL_CTX_set_alpn_protos for... — committed to jedisct1/unbound by jedisct1 3 years ago
- Fix for #411, #439, #469: Reset the DNS message ID when moving queries between TCP streams. - Refactor for uniform way to produce random DNS message IDs. — committed to NLnetLabs/unbound by gthess 3 years ago
Merge remote-tracking branch 'nlnet/master' * nlnet/master: - zonemd-check: yesno option, default no, enables the processing of ZONEMD records for that zone. - Merge #496 from banburybill: Use ... — committed to jedisct1/unbound by jedisct1 3 years ago
- Fix for #411, #439, #469: stream reuse, fix linking when touching the tcp_reuse LRU list. — committed to NLnetLabs/unbound by gthess 3 years ago
- Fix for #411, #439, #469: stream reuse, fix LRU list when reuse is already in the tree. — committed to NLnetLabs/unbound by gthess 3 years ago
- Fix for #411, #439, #469: stream reuse, fix outnet deletion for all non-free pending_tcp. — committed to NLnetLabs/unbound by gthess 3 years ago
- Fix for #411, #439, #469: stream reuse, fix loop in the free pending_tcp list. — committed to NLnetLabs/unbound by gthess 3 years ago
- Changelog entry for #513: Stream reuse, attempt to fix #411, #439, #469. — committed to NLnetLabs/unbound by gthess 3 years ago
Merge remote-tracking branch 'nlnet/master' * nlnet/master: - Changelog entry for #513: Stream reuse, attempt to fix #411, #439, #469. - Fix readzone unknown type print for memory resize. - F... — committed to jedisct1/unbound by jedisct1 3 years ago
Merge pull request #513 from NLnetLabs/tcp_reuse_fix Stream reuse, attempt to fix #411, #439, #469 — committed to internetstandards/unbound by gthess 3 years ago

Most upvoted comments

Hi @jcjones, @Mityai, @internationils, There is a possible fix on master branch (https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a) for this. It would be great if you could test and provide feedback!

gthess on May 19, 2021

Hi @jcjones, @Mityai, @internationils, Further fixes have been merged (PR #513) to the master branch. Our own testing does not yield the issue anymore but it would be great if you could test and provide feedback!

gthess on Jul 26, 2021

Hi @jcjones, @Mityai, just checking if you were able to test with the aforementioned fix.

No, not yet. Had to move this off my juggling-stack for other issues and haven’t had time to reintroduce it in the interval. As soon as possible, though.

jcjones on May 28, 2021

I can’t test it, but the PFsense people have grabbed it already… https://redmine.pfsense.org/issues/11316#change-53857

internationils on May 19, 2021

I’m still seeing these spuriously with 1.13.1 release, they look the same. E.g.:

Core was generated by `/usr/sbin/unbound -d'.
Program terminated with signal 11, Segmentation fault.
#0  reuse_cmp_addrportssl (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:148
148		r = sockaddr_cmp(&r1->addr, r1->addrlen, &r2->addr, r2->addrlen);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 protobuf-c-1.0.2-3.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  reuse_cmp_addrportssl (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:148
#1  0x000055aefd4df251 in reuse_cmp (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:164
#2  0x000055aefd49a45e in rbtree_find_less_equal (rbtree=rbtree@entry=0x7f3b52428198, key=key@entry=0x7f3c049a6588, 
    result=result@entry=0x7f3c049a6568) at util/rbtree.c:527
#3  0x000055aefd4df99c in reuse_tcp_find (outnet=0x7f3b52428090, addr=addr@entry=0x7f3b3c856808, addrlen=28, use_ssl=<optimized out>)
    at services/outside_network.c:487
#4  0x000055aefd4e322d in pending_tcp_query (sq=sq@entry=0x7f3b3c8567b0, packet=packet@entry=0x7f3b52428350, 
    timeout=timeout@entry=3000, callback=callback@entry=0x55aefd4e37d0 <serviced_tcp_callback>, 
    callback_arg=callback_arg@entry=0x7f3b3c8567b0) at services/outside_network.c:2120
#5  0x000055aefd4e3789 in serviced_tcp_initiate (sq=0x7f3b3c8567b0, buff=0x7f3b52428350) at services/outside_network.c:2807
#6  0x000055aefd4e3d51 in serviced_udp_callback (c=0x7f3b3c287800, arg=0x7f3b3c8567b0, error=<optimized out>, rep=0x7f3c049a6c30)
    at services/outside_network.c:3010
#7  0x000055aefd4e198a in outnet_udp_cb (c=0x7f3b3c287800, arg=0x7f3b52428090, error=<optimized out>, reply_info=0x7f3c049a6c30)
    at services/outside_network.c:1243
#8  0x000055aefd4d6d58 in comm_point_udp_callback (fd=159, event=<optimized out>, arg=<optimized out>) at util/netevent.c:769
#9  0x00007f3c06e56a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x000055aefd4d69cc in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#11 0x000055aefd453779 in worker_work (worker=worker@entry=0x55aefdd24e60) at daemon/worker.c:1949
#12 0x000055aefd4474cf in thread_start (arg=0x55aefdd24e60) at daemon/daemon.c:540
#13 0x00007f3c067ceea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3c064f79fd in clone () from /lib64/libc.so.6

No instances of “internal error” in the logs at all. No output from unbound in the logs anytime close to the segfault.

Most interesting to me is that I have two of these same segfaults occurring within 10 minutes of each other, whereas before they always seemed to need substantial time to reproduce:

$ sudo coredumpctl list
TIME                            PID   UID   GID SIG PRESENT EXE
Sat 2021-02-20 12:32:09 UTC   20030   997   993  11   /usr/sbin/unbound
Sun 2021-02-21 06:49:44 UTC   26011   997   993  11   /usr/sbin/unbound
Tue 2021-02-23 04:41:03 UTC    8149   997   993  11   /usr/sbin/unbound
Thu 2021-02-25 08:25:12 UTC   26630   997   993  11   /usr/sbin/unbound
Thu 2021-02-25 11:43:34 UTC   10285   997   993  11   /usr/sbin/unbound
Mon 2021-03-01 19:41:24 UTC   24626   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 20:43:46 UTC   20576   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 23:27:30 UTC   25400   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 23:37:13 UTC    4760   997   993  11 * /usr/sbin/unbound

jcjones on Mar 4, 2021

I’m afraid a reproducer is not going to be likely given the nature of the input data to these instances; sorry. Still, if something comes up, I’ll let you know.

jcjones on Jan 27, 2021

Thanks! At least this excludes a path I was looking at: forwarders and tls configuration. Having a NULL on the key and the node somehow still being part of the tree is the fault here; still looking into it. If you notice a way to reliably reproduce it that would be a big plus.

gthess on Jan 27, 2021