openssl: [3.2.0-beta1] quicapitest hangs for NonStop thread (SPT) builds
When testing NonStop builds for 3.2.0-beta1, quicapitest (and sometimes quic_tserver_test) hangs waiting for initial input. This did not occur with 3.2.0-alpha2 but this may be transient.
Stopping the hung processes to not cause the tests to fail, which is odd itself.
This only applies to pthread builds using the SPT model. PUT is fine, but it is much faster implementation so any transient problems may get masked. Unthreaded builds work consistently fine.
Config dump:
Command line (with current working directory = .):
perl ./Configure nonstop-nsx_spt_floss --prefix=/usr/local-ssl3.2 --openssldir=/usr/local-ssl3.2/ssl --with-rand-seed=rdcpu threads enable-fips -D_REENTRANT
Perl information:
perl
5.30.3 for NSV-D-nonstop_kernel
Enabled features:
acvp-tests
apps
argon2
aria
async
autoalginit
autoerrinit
autoload-config
bf
blake2
bulk
cached-fetch
camellia
capieng
cast
chacha
cmac
cmp
cms
comp
ct
default-thread-pool
deprecated
des
dgram
dh
docs
dsa
dso
dtls
dynamic-engine
ec
ec2m
ecdh
ecdsa
ecx
egd
engine
err
filenames
fips
fips-securitychecks
gost
http
idea
legacy
loadereng
md4
mdc2
module
multiblock
nextprotoneg
ocb
ocsp
padlockeng
pic
pinshared
poly1305
posix-io
psk
quic
rc2
rc4
rdrand
rfc3779
rmd160
scrypt
secure-memory
seed
shared
siphash
siv
sm2
sm2-precomp
sm3
sm4
sock
srp
srtp
sse2
ssl
ssl-trace
static-engine
stdio
tests
thread-pool
threads
tls
ts
ui-console
whirlpool
tls1
tls1-method
tls1_1
tls1_1-method
tls1_2
tls1_2-method
tls1_3
dtls1
dtls1-method
dtls1_2
dtls1_2-method
Disabled features:
afalgeng [not-linux] OPENSSL_NO_AFALGENG
asan [default] OPENSSL_NO_ASAN
asm [no asm_arch] OPENSSL_NO_ASM
brotli [default] OPENSSL_NO_BROTLI
brotli-dynamic [default] OPENSSL_NO_BROTLI_DYNAMIC
buildtest-c++ [default]
winstore [not-windows] OPENSSL_NO_WINSTORE
crypto-mdebug [default] OPENSSL_NO_CRYPTO_MDEBUG
devcryptoeng [default] OPENSSL_NO_DEVCRYPTOENG
ec_nistp_64_gcc_128 [default] OPENSSL_NO_EC_NISTP_64_GCC_128
external-tests [default] OPENSSL_NO_EXTERNAL_TESTS
fuzz-afl [default] OPENSSL_NO_FUZZ_AFL
fuzz-libfuzzer [default] OPENSSL_NO_FUZZ_LIBFUZZER
ktls [default] OPENSSL_NO_KTLS
makedepend [unavailable]
md2 [default] OPENSSL_NO_MD2 (skip crypto/md2)
msan [default] OPENSSL_NO_MSAN
rc5 [default] OPENSSL_NO_RC5 (skip crypto/rc5)
sctp [default] OPENSSL_NO_SCTP
tfo [default] OPENSSL_NO_TFO
trace [default] OPENSSL_NO_TRACE
ubsan [default] OPENSSL_NO_UBSAN
unit-test [default] OPENSSL_NO_UNIT_TEST
uplink [no uplink_arch] OPENSSL_NO_UPLINK
weak-ssl-ciphers [default] OPENSSL_NO_WEAK_SSL_CIPHERS
zlib [default] OPENSSL_NO_ZLIB
zlib-dynamic [default] OPENSSL_NO_ZLIB_DYNAMIC
zstd [default] OPENSSL_NO_ZSTD
zstd-dynamic [default] OPENSSL_NO_ZSTD_DYNAMIC
ssl3 [default] OPENSSL_NO_SSL3
ssl3-method [default] OPENSSL_NO_SSL3_METHOD
Config target attributes:
AR => "ar",
ARFLAGS => "qc",
CC => "c99",
HASHBANGPERL => "/usr/bin/env perl",
RANLIB => "ranlib",
RC => "windres",
bn_ops => "THIRTY_TWO_BIT",
build_file => "Makefile",
build_scheme => [ "unified", "unix" ],
cflags => "-g -O2 -Wextensions -Wnowarn=203,220,272,734,770,1506 -Wbuild_neutral_library -Wverbose -Wtarget=tns/x -Wsystype=oss -Wilp32 -WIEEE_float",
cppflags => "",
defines => [ "OPENSSL_BUILDING_OPENSSL", "OPENSSL_VPROC=\$(OPENSSL_VPROC)", "_XOPEN_SOURCE", "_XOPEN_SOURCE_EXTENDED=1", "_TANDEM_SOURCE", "__NSK_OPTIONAL_TYPES__", "B_ENDIAN", "OPENSSL_TANDEM_FLOSS", "_SPT_MODEL_", "_REENTRANT", "_ENABLE_FLOSS_THREADS" ],
disable => [ ],
dso_scheme => "DLFCN",
enable => [ "egd" ],
ex_libs => "-lrld -lfloss -lspt",
includes => [ "/usr/local/include" ],
lflags => "-Wxld=\"-set systype oss\" -Wxld=\"-set floattype ieee_float\"",
lib_cflags => "",
lib_cppflags => "",
lib_defines => [ ],
module_cflags => "",
module_cxxflags => undef,
module_ldflags => "-Wshared",
multibin => "-spt",
multilib => "-spt",
perl => "/usr/bin/perl",
perl_platform => "Unix",
shared_argfileflag => "-Wxld_obey=",
shared_cflag => "",
shared_defflag => "-Wxld_obey=",
shared_defines => [ ],
shared_extension => ".so",
shared_ldflag => "-Wshared",
shared_rcflag => "",
shared_target => "nonstop-shared",
sys_id => "TANDEM",
thread_defines => [ ],
thread_scheme => "(unknown)",
unistd => "<unistd.h>",
Recorded environment:
AR =
BUILDFILE =
CC =
CFLAGS =
CPPFLAGS =
CROSS_COMPILE =
CXX =
CXXFLAGS =
HASHBANGPERL =
LDFLAGS =
LDLIBS =
OPENSSL_LOCAL_CONFIG_DIR =
PERL =
RANLIB =
RC =
RCFLAGS =
WINDRES =
__CNF_CFLAGS =
__CNF_CPPDEFINES =
__CNF_CPPFLAGS =
__CNF_CPPINCLUDES =
__CNF_CXXFLAGS =
__CNF_LDFLAGS =
__CNF_LDLIBS =
Makevars:
AR = ar
ARFLAGS = qc
ASFLAGS =
CC = c99
CFLAGS =
CPPDEFINES = _REENTRANT
CPPFLAGS =
CPPINCLUDES =
CXXFLAGS =
HASHBANGPERL = /usr/bin/env perl
LDFLAGS =
LDLIBS =
PERL = perl
RANLIB = ranlib
RC = windres
RCFLAGS =
NOTE: These variables only represent the configuration view. The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
Makefile
build file:
Makefile
build file templates:
Configurations/common0.tmpl
Configurations/unix-Makefile.tmpl
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 169 (169 by maintainers)
Commits related to this issue
- Correct order of ossl_condvar_signal in quic_multistream_test quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race con... — committed to nhorman/openssl by nhorman 8 months ago
- Correct order of ossl_condvar_signal in quic_multistream_test quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race con... — committed to nhorman/openssl by nhorman 8 months ago
- Correct order of ossl_condvar_signal in quic_multistream_test quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race con... — committed to openssl/openssl by nhorman 8 months ago
- Force Nonstop to use fcntl(F_GETFL) in BIO_sock_nbio In tracking down a hang, we found that nonstop platforms were falling into the if #ifdef FIONBIO clause in the implementation of BIO_sock_nbio. W... — committed to nhorman/openssl by nhorman 8 months ago
- Correct 50-nonstop.conf to support QUIC tests under SPT threading models. This fix also separates the FLOSS from SPT configurations which should not have been conflated in the 3.0 series. Related-to... — committed to rsbeckerca/openssl by rsbeckerca 8 months ago
- Force Nonstop to use fcntl(F_GETFL) in BIO_sock_nbio In tracking down a hang, we found that nonstop platforms were falling into the if #ifdef FIONBIO clause in the implementation of BIO_sock_nbio. W... — committed to openssl/openssl by nhorman 8 months ago
- Force Nonstop to use fcntl(F_GETFL) in BIO_sock_nbio In tracking down a hang, we found that nonstop platforms were falling into the if #ifdef FIONBIO clause in the implementation of BIO_sock_nbio. W... — committed to openssl/openssl by nhorman 8 months ago
- Correct 50-nonstop.conf to support QUIC tests under SPT threading models. This fix also separates the FLOSS from SPT configurations which should not have been conflated in the 3.0 series. Related-to... — committed to openssl/openssl by rsbeckerca 8 months ago
- Correct 50-nonstop.conf to support QUIC tests under SPT threading models. This fix also separates the FLOSS from SPT configurations which should not have been conflated in the 3.0 series. Related-to... — committed to openssl/openssl by rsbeckerca 8 months ago
- Correct order of ossl_condvar_signal in quic_multistream_test quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race con... — committed to GerHobbelt/openssl by nhorman 8 months ago
First, thanks for the details
Second, whoa, thats way off from what I thought was going on. That looks like its spinning trying to read in the config file while trying to get some random data via test_random before the dgram test even does any real work for the test itself.
I’ll need to think about this for a bit
There were some issues with the new branch in
crypto/bio/bss_dgram_pair.cand50-nonstop.confbut those were corrected. I am proceeding with the build/test cycle.I feel like its kind of the reverse of that. That is to say, previously when you didn’t have SPT_THREAD_AWARE_NONBLOCK configured, we observed in the quic_multistream test that recvfrom mapped to spt_recvfrom, and that yielded the cpu, even when the socket it was called on was non-blocking. By changing the build to set SPT_THREAD_AWARE_NONBLOCK we altered the behavior of recvfrom (which now maps to spt_recvfromx) to return EWOULDBLOCK without issuing a yield, matching the behavior expected by the higher layers of code.
But once we did that, we started seeing this new issue, in which (if I’m reading you right), isn’t strictly a deadlock, but rather a spin loop in the higher layer code that never exits. Looking at the dgram_pair code we see it has no (or very few) syscalls, and makes no calls to recvfrom or sendto (opting to use an in memory fifo instead). I’m wondering if setting SPT_THREAD_AWARE_NONBLOCK altered some other syscall (that was previously a yield point), to no longer yield, causing the recv path to spin forever waiting for some other thread to send, preventing another thread from acquiring the cpu.
If its ok with you I’ll start a new branch with our combined changes, and start instrumenting the receive path for dgram_pair to see whats going on there
Actually, I think thats looking pretty good! If you look at the logs, we clearly have alot of repeated messages, which is to be expected, because the test run quic_update_tick (or whatever the server tick update function is called), which calls down into demux_recv, and it does that iteratively for every entry in the script array. If you note, on lines 153, 175, 194, we see “Handling operation <num>”, which is an indication that we are calling through the script operations properly. and on line 194, we see that the dyn script iteration passed.
So this is really good. There might be other errors, but we’re at least passing one test in which we hung.
in fact looking at line 197, 341, 506, 642, 778, 913, etc, we’re passing several of the tests
It does look like there is an infinite loop at the tail of the log, as you say though, where we are “Handling operation 7” over and over. Will need to dig into that next.
I think a good next step here is to take stock of our progress, identify the changes that we need to fix this hang we’ve been tracking, and propose a PR for that, then move on to this next issue, as it seems somewhat distinct. What do you think?
I think what we need to change for this first hang is:
Though I can confirm from visual inspection of the code that we are binding to 127.0.0.1 in helper_init:
No problem & best wishes. No rush.
Thanks for the great effort you’ve put into maintaining OpenSSL on NonStop.
BIO_socket_nbiois here:https://github.com/openssl/openssl/blob/7cf75e5caa71a54539d4559c3fb6b0a48b92243f/crypto/bio/bio_sock.c#L351-L395
Maybe this needs some adjustment for NonStop? There’s quite a bit of conditional code in that function. @rsbeckerca - can you identify which of the various defines are present for NonStop and therefore what codepath we go down in that function?
Maybe setting the socket to be non-blocking is not working here (server side):
https://github.com/openssl/openssl/blob/7cf75e5caa71a54539d4559c3fb6b0a48b92243f/test/quic_multistream_test.c#L689-L690
and here (client side):
https://github.com/openssl/openssl/blob/7cf75e5caa71a54539d4559c3fb6b0a48b92243f/test/quic_multistream_test.c#L756-L757
Ok, thats good information, and it correlates to the last stack trace you provided. Its really starting to feel like we’re somehow getting stuck in dmux_recv. I’ll instrument that function in the AM, and we’ll see if we can’t pin it down more closely.
Working it now. Will report with results. Thanks.
@rsbeckerca would you mind pulling/trying https://github.com/openssl/openssl/pull/22616 one more time. In cleaning it up on review, I just found one more case of a reversed mutex_unlock/condvar_broadcast that I had missed. It had caused a failure in CI that reported as a timeout which would have effectively been a deadlock.
So the only thing I can think to do here in, given that it doesn’t seem like the stack trace youve managed to obtain doesn’t seem like it.could hang on it’s own is to start instrumenting the code to get some.elevated visibility. I’ll provide a debug patch in the next few days
apologies, I’m not used to seeing nonstop traces, and assumed it came from the parent process.
That said, I may see the problem. quic_multistream_test uses several (under the covers) pthread_cond/pthread_mutex calls, and appears to have reversed the order of mutex_unlock and cond_signal. I’ll have a pr for you shortly to test please