unbound: null pointer in services/outside_network.c:160 reuse_cmp / rbtree_find_less_equal in Unbound 1.13.0 release
Using my amd64 linux on centos7 build from https://github.com/NLnetLabs/unbound/issues/393#issuecomment-760618418 with these commits added to 1.13.0:
- 4d51c6b
- 08968ba
- 422213c
- 7e46204bf73ecdee56e1ab1c48e1829d71cdbc0a
I am still getting rare crashes. I’ve caught one here, in reuse_cmp
having a nullptr for key2
, coming from node->key
:
https://github.com/NLnetLabs/unbound/blob/ca497815b82587d5e7db7ddb83e9c30fa68585f8/util/rbtree.c#L525-L528
The backtrace is:
(gdb) bt
#0 reuse_cmp_addrportssl (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:144
#1 0x000055e21f6ba7c1 in (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
#2 0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
#3 0x000055e21f6baf0c in reuse_tcp_find (outnet=outnet@entry=0x7fdd7a428090, addr=addr@entry=0x7fdd6718d6f0, addrlen=16, use_ssl=<optimized out>) at services/outside_network.c:480
#4 0x000055e21f6bbf5f in use_free_buffer (outnet=outnet@entry=0x7fdd7a428090) at services/outside_network.c:723
#5 0x000055e21f6bc4fb in outnet_tcp_cb (c=0x7fdd61fd0ce0, arg=0x7fdd61fd0bb0, error=<optimized out>, reply_info=0x7fdd61fd0d18) at services/outside_network.c:1095
#6 0x000055e21f6b4087 in tcp_callback_reader (c=0x7fdd61fd0ce0) at util/netevent.c:1144
#7 0x000055e21f6b5548 in comm_point_tcp_handle_read (fd=217, c=0x7fdd61fd0ce0, short_ok=0) at util/netevent.c:1668
#8 0x000055e21f6b584b in comm_point_tcp_handle_callback (fd=217, event=<optimized out>, arg=0x7fdd61fd0ce0) at util/netevent.c:2062
#9 0x00007fde30723a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x000055e21f6b1fac in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#11 0x000055e21f62e499 in worker_work (worker=worker@entry=0x55e2216803b0) at daemon/worker.c:1941
#12 0x000055e21f6222bf in thread_start (arg=0x55e2216803b0) at daemon/daemon.c:540
#13 0x00007fde3009bea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fde2fdc496d in clone () from /lib64/libc.so.6
At the null pointer, *node
is normal except for the null:
(gdb) up
#1 0x000055e21f6ba7c1 in reuse_cmp (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
160 r = reuse_cmp_addrportssl(key1, key2);
(gdb) up
#2 0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
527 r = rbtree->cmp(key, node->key);
(gdb) p *node
$2 = {parent = 0x7fdd61a09468, left = 0x55e21f922460 <rbtree_null_node>, right = 0x55e21f922460 <rbtree_null_node>, key = 0x0, color = 1 '\001'}
The core file is 6.36 GB; I can certainly share it and the centos7 rpm files out-of-band if you’d like to investigate directly, or I am happy to dig around in the core file in response to your questions. Thanks again!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 35 (16 by maintainers)
Commits related to this issue
- - Attempt to fix NULL keys in the reuse_tcp tree; relates to #411. — committed to NLnetLabs/unbound by gthess 3 years ago
- Merge remote-tracking branch 'nlnet/master' * nlnet/master: - Fix for Python 3.9, no longer use deprecated functions of PyEval_CallObject (now PyObject_Call), PyEval_InitThreads (now none), PyP... — committed to jedisct1/unbound by jedisct1 3 years ago
- - Debug output for #411 and #439: printout internal error and details. — committed to NLnetLabs/unbound by wcawijngaards 3 years ago
- - Fix for #411: Depth protect for crash on deleted element timeout. — committed to NLnetLabs/unbound by wcawijngaards 3 years ago
- Merge remote-tracking branch 'nlnet/master' * nlnet/master: (61 commits) - Fix that testcode dohclient has OpenSSL initialisation calls. - Further fix for #468: detect SSL_CTX_set_alpn_protos for... — committed to jedisct1/unbound by jedisct1 3 years ago
- - Fix for #411, #439, #469: Reset the DNS message ID when moving queries between TCP streams. - Refactor for uniform way to produce random DNS message IDs. — committed to NLnetLabs/unbound by gthess 3 years ago
- Merge remote-tracking branch 'nlnet/master' * nlnet/master: - zonemd-check: yesno option, default no, enables the processing of ZONEMD records for that zone. - Merge #496 from banburybill: Use ... — committed to jedisct1/unbound by jedisct1 3 years ago
- - Fix for #411, #439, #469: stream reuse, fix linking when touching the tcp_reuse LRU list. — committed to NLnetLabs/unbound by gthess 3 years ago
- - Fix for #411, #439, #469: stream reuse, fix LRU list when reuse is already in the tree. — committed to NLnetLabs/unbound by gthess 3 years ago
- - Fix for #411, #439, #469: stream reuse, fix outnet deletion for all non-free pending_tcp. — committed to NLnetLabs/unbound by gthess 3 years ago
- - Fix for #411, #439, #469: stream reuse, fix loop in the free pending_tcp list. — committed to NLnetLabs/unbound by gthess 3 years ago
- - Changelog entry for #513: Stream reuse, attempt to fix #411, #439, #469. — committed to NLnetLabs/unbound by gthess 3 years ago
- Merge remote-tracking branch 'nlnet/master' * nlnet/master: - Changelog entry for #513: Stream reuse, attempt to fix #411, #439, #469. - Fix readzone unknown type print for memory resize. - F... — committed to jedisct1/unbound by jedisct1 3 years ago
- Merge pull request #513 from NLnetLabs/tcp_reuse_fix Stream reuse, attempt to fix #411, #439, #469 — committed to internetstandards/unbound by gthess 3 years ago
Hi @jcjones, @Mityai, @internationils, There is a possible fix on master branch (https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a) for this. It would be great if you could test and provide feedback!
Hi @jcjones, @Mityai, @internationils, Further fixes have been merged (PR #513) to the master branch. Our own testing does not yield the issue anymore but it would be great if you could test and provide feedback!
No, not yet. Had to move this off my juggling-stack for other issues and haven’t had time to reintroduce it in the interval. As soon as possible, though.
I can’t test it, but the PFsense people have grabbed it already… https://redmine.pfsense.org/issues/11316#change-53857
I’m still seeing these spuriously with 1.13.1 release, they look the same. E.g.:
No instances of “internal error” in the logs at all. No output from
unbound
in the logs anytime close to the segfault.Most interesting to me is that I have two of these same segfaults occurring within 10 minutes of each other, whereas before they always seemed to need substantial time to reproduce:
I’m afraid a reproducer is not going to be likely given the nature of the input data to these instances; sorry. Still, if something comes up, I’ll let you know.
Thanks! At least this excludes a path I was looking at: forwarders and tls configuration. Having a NULL on the key and the node somehow still being part of the tree is the fault here; still looking into it. If you notice a way to reliably reproduce it that would be a big plus.