openssl: [3.2.0-beta1] quicapitest hangs for NonStop thread (SPT) builds

When testing NonStop builds for 3.2.0-beta1, quicapitest (and sometimes quic_tserver_test) hangs waiting for initial input. This did not occur with 3.2.0-alpha2 but this may be transient.

Stopping the hung processes to not cause the tests to fail, which is odd itself.

This only applies to pthread builds using the SPT model. PUT is fine, but it is much faster implementation so any transient problems may get masked. Unthreaded builds work consistently fine.

Config dump:

Command line (with current working directory = .):

    perl ./Configure nonstop-nsx_spt_floss --prefix=/usr/local-ssl3.2 --openssldir=/usr/local-ssl3.2/ssl --with-rand-seed=rdcpu threads enable-fips -D_REENTRANT

Perl information:

    perl
    5.30.3 for NSV-D-nonstop_kernel

Enabled features:

    acvp-tests
    apps
    argon2
    aria
    async
    autoalginit
    autoerrinit
    autoload-config
    bf
    blake2
    bulk
    cached-fetch
    camellia
    capieng
    cast
    chacha
    cmac
    cmp
    cms
    comp
    ct
    default-thread-pool
    deprecated
    des
    dgram
    dh
    docs
    dsa
    dso
    dtls
    dynamic-engine
    ec
    ec2m
    ecdh
    ecdsa
    ecx
    egd
    engine
    err
    filenames
    fips
    fips-securitychecks
    gost
    http
    idea
    legacy
    loadereng
    md4
    mdc2
    module
    multiblock
    nextprotoneg
    ocb
    ocsp
    padlockeng
    pic
    pinshared
    poly1305
    posix-io
    psk
    quic
    rc2
    rc4
    rdrand
    rfc3779
    rmd160
    scrypt
    secure-memory
    seed
    shared
    siphash
    siv
    sm2
    sm2-precomp
    sm3
    sm4
    sock
    srp
    srtp
    sse2
    ssl
    ssl-trace
    static-engine
    stdio
    tests
    thread-pool
    threads
    tls
    ts
    ui-console
    whirlpool
    tls1
    tls1-method
    tls1_1
    tls1_1-method
    tls1_2
    tls1_2-method
    tls1_3
    dtls1
    dtls1-method
    dtls1_2
    dtls1_2-method

Disabled features:

    afalgeng            [not-linux]      OPENSSL_NO_AFALGENG
    asan                [default]        OPENSSL_NO_ASAN
    asm                 [no asm_arch]    OPENSSL_NO_ASM
    brotli              [default]        OPENSSL_NO_BROTLI
    brotli-dynamic      [default]        OPENSSL_NO_BROTLI_DYNAMIC
    buildtest-c++       [default]
    winstore            [not-windows]    OPENSSL_NO_WINSTORE
    crypto-mdebug       [default]        OPENSSL_NO_CRYPTO_MDEBUG
    devcryptoeng        [default]        OPENSSL_NO_DEVCRYPTOENG
    ec_nistp_64_gcc_128 [default]        OPENSSL_NO_EC_NISTP_64_GCC_128
    external-tests      [default]        OPENSSL_NO_EXTERNAL_TESTS
    fuzz-afl            [default]        OPENSSL_NO_FUZZ_AFL
    fuzz-libfuzzer      [default]        OPENSSL_NO_FUZZ_LIBFUZZER
    ktls                [default]        OPENSSL_NO_KTLS
    makedepend          [unavailable]
    md2                 [default]        OPENSSL_NO_MD2 (skip crypto/md2)
    msan                [default]        OPENSSL_NO_MSAN
    rc5                 [default]        OPENSSL_NO_RC5 (skip crypto/rc5)
    sctp                [default]        OPENSSL_NO_SCTP
    tfo                 [default]        OPENSSL_NO_TFO
    trace               [default]        OPENSSL_NO_TRACE
    ubsan               [default]        OPENSSL_NO_UBSAN
    unit-test           [default]        OPENSSL_NO_UNIT_TEST
    uplink              [no uplink_arch] OPENSSL_NO_UPLINK
    weak-ssl-ciphers    [default]        OPENSSL_NO_WEAK_SSL_CIPHERS
    zlib                [default]        OPENSSL_NO_ZLIB
    zlib-dynamic        [default]        OPENSSL_NO_ZLIB_DYNAMIC
    zstd                [default]        OPENSSL_NO_ZSTD
    zstd-dynamic        [default]        OPENSSL_NO_ZSTD_DYNAMIC
    ssl3                [default]        OPENSSL_NO_SSL3
    ssl3-method         [default]        OPENSSL_NO_SSL3_METHOD

Config target attributes:

    AR => "ar",
    ARFLAGS => "qc",
    CC => "c99",
    HASHBANGPERL => "/usr/bin/env perl",
    RANLIB => "ranlib",
    RC => "windres",
    bn_ops => "THIRTY_TWO_BIT",
    build_file => "Makefile",
    build_scheme => [ "unified", "unix" ],
    cflags => "-g -O2 -Wextensions -Wnowarn=203,220,272,734,770,1506 -Wbuild_neutral_library -Wverbose -Wtarget=tns/x -Wsystype=oss -Wilp32 -WIEEE_float",
    cppflags => "",
    defines => [ "OPENSSL_BUILDING_OPENSSL", "OPENSSL_VPROC=\$(OPENSSL_VPROC)", "_XOPEN_SOURCE", "_XOPEN_SOURCE_EXTENDED=1", "_TANDEM_SOURCE", "__NSK_OPTIONAL_TYPES__", "B_ENDIAN", "OPENSSL_TANDEM_FLOSS", "_SPT_MODEL_", "_REENTRANT", "_ENABLE_FLOSS_THREADS" ],
    disable => [  ],
    dso_scheme => "DLFCN",
    enable => [ "egd" ],
    ex_libs => "-lrld -lfloss -lspt",
    includes => [ "/usr/local/include" ],
    lflags => "-Wxld=\"-set systype oss\" -Wxld=\"-set floattype ieee_float\"",
    lib_cflags => "",
    lib_cppflags => "",
    lib_defines => [  ],
    module_cflags => "",
    module_cxxflags => undef,
    module_ldflags => "-Wshared",
    multibin => "-spt",
    multilib => "-spt",
    perl => "/usr/bin/perl",
    perl_platform => "Unix",
    shared_argfileflag => "-Wxld_obey=",
    shared_cflag => "",
    shared_defflag => "-Wxld_obey=",
    shared_defines => [  ],
    shared_extension => ".so",
    shared_ldflag => "-Wshared",
    shared_rcflag => "",
    shared_target => "nonstop-shared",
    sys_id => "TANDEM",
    thread_defines => [  ],
    thread_scheme => "(unknown)",
    unistd => "<unistd.h>",

Recorded environment:

    AR =
    BUILDFILE =
    CC =
    CFLAGS =
    CPPFLAGS =
    CROSS_COMPILE =
    CXX =
    CXXFLAGS =
    HASHBANGPERL =
    LDFLAGS =
    LDLIBS =
    OPENSSL_LOCAL_CONFIG_DIR =
    PERL =
    RANLIB =
    RC =
    RCFLAGS =
    WINDRES =
    __CNF_CFLAGS =
    __CNF_CPPDEFINES =
    __CNF_CPPFLAGS =
    __CNF_CPPINCLUDES =
    __CNF_CXXFLAGS =
    __CNF_LDFLAGS =
    __CNF_LDLIBS =

Makevars:

    AR              = ar
    ARFLAGS         = qc
    ASFLAGS         =
    CC              = c99
    CFLAGS          =
    CPPDEFINES      = _REENTRANT
    CPPFLAGS        =
    CPPINCLUDES     =
    CXXFLAGS        =
    HASHBANGPERL    = /usr/bin/env perl
    LDFLAGS         =
    LDLIBS          =
    PERL            = perl
    RANLIB          = ranlib
    RC              = windres
    RCFLAGS         =

NOTE: These variables only represent the configuration view.  The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
    Makefile

build file:

    Makefile

build file templates:

    Configurations/common0.tmpl
    Configurations/unix-Makefile.tmpl

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 169 (169 by maintainers)

Commits related to this issue

Most upvoted comments

First, thanks for the details

Second, whoa, thats way off from what I thought was going on. That looks like its spinning trying to read in the config file while trying to get some random data via test_random before the dgram test even does any real work for the test itself.

I’ll need to think about this for a bit

There were some issues with the new branch in crypto/bio/bss_dgram_pair.c and 50-nonstop.conf but those were corrected. I am proceeding with the build/test cycle.

I feel like its kind of the reverse of that. That is to say, previously when you didn’t have SPT_THREAD_AWARE_NONBLOCK configured, we observed in the quic_multistream test that recvfrom mapped to spt_recvfrom, and that yielded the cpu, even when the socket it was called on was non-blocking. By changing the build to set SPT_THREAD_AWARE_NONBLOCK we altered the behavior of recvfrom (which now maps to spt_recvfromx) to return EWOULDBLOCK without issuing a yield, matching the behavior expected by the higher layers of code.

But once we did that, we started seeing this new issue, in which (if I’m reading you right), isn’t strictly a deadlock, but rather a spin loop in the higher layer code that never exits. Looking at the dgram_pair code we see it has no (or very few) syscalls, and makes no calls to recvfrom or sendto (opting to use an in memory fifo instead). I’m wondering if setting SPT_THREAD_AWARE_NONBLOCK altered some other syscall (that was previously a yield point), to no longer yield, causing the recv path to spin forever waiting for some other thread to send, preventing another thread from acquiring the cpu.

If its ok with you I’ll start a new branch with our combined changes, and start instrumenting the receive path for dgram_pair to see whats going on there

Actually, I think thats looking pretty good! If you look at the logs, we clearly have alot of repeated messages, which is to be expected, because the test run quic_update_tick (or whatever the server tick update function is called), which calls down into demux_recv, and it does that iteratively for every entry in the script array. If you note, on lines 153, 175, 194, we see “Handling operation <num>”, which is an indication that we are calling through the script operations properly. and on line 194, we see that the dyn script iteration passed.

So this is really good. There might be other errors, but we’re at least passing one test in which we hung.

in fact looking at line 197, 341, 506, 642, 778, 913, etc, we’re passing several of the tests

It does look like there is an infinite loop at the tail of the log, as you say though, where we are “Handling operation 7” over and over. Will need to dig into that next.

I think a good next step here is to take stock of our progress, identify the changes that we need to fix this hang we’ve been tracking, and propose a PR for that, then move on to this next issue, as it seems somewhat distinct. What do you think?

I think what we need to change for this first hang is:

  1. We need to update the define in BIO_set_nbio to force nonstop to build the fcntl/O_NONBLOCK code rather than the FIONBIO code (as the latter seems like it will never set non-blocking mode).
  2. I think you probably need to document/remember/something to ensure that openssl is built with SPT_THREAD_AWARE_NONBLOCK. Or is that something you think should be integrated into the nonstop config template?

Though I can confirm from visual inspection of the code that we are binding to 127.0.0.1 in helper_init:

ina.s_addr = htonl(0x7f000001UL);

I am out of office until this evening on a family emergency. Please stay tuned.

No problem & best wishes. No rush.

Thanks for the great effort you’ve put into maintaining OpenSSL on NonStop.

BIO_socket_nbio is here:

https://github.com/openssl/openssl/blob/7cf75e5caa71a54539d4559c3fb6b0a48b92243f/crypto/bio/bio_sock.c#L351-L395

Maybe this needs some adjustment for NonStop? There’s quite a bit of conditional code in that function. @rsbeckerca - can you identify which of the various defines are present for NonStop and therefore what codepath we go down in that function?

Ok, thats good information, and it correlates to the last stack trace you provided. Its really starting to feel like we’re somehow getting stuck in dmux_recv. I’ll instrument that function in the AM, and we’ll see if we can’t pin it down more closely.

Ok, not seeing a better way to get visibility on this issue at the moment, so I think we’re just going to have to go with some iterative-printf-ing to gain some insight here

@rsbeckerca I’ve got a branch here: https://github.com/nhorman/openssl/tree/nonstop-quic-hang Based on my previous fixes, adding some printfs to run_worker_script to start drilling down on where this problem may lie. If you would please, build that branch and run it (either by running the quic_multistream_test manually, or by using the test target with V=1 so the printfs are produced). Hopefully this will start to tell us more.

Working it now. Will report with results. Thanks.

@rsbeckerca would you mind pulling/trying https://github.com/openssl/openssl/pull/22616 one more time. In cleaning it up on review, I just found one more case of a reversed mutex_unlock/condvar_broadcast that I had missed. It had caused a failure in CI that reported as a timeout which would have effectively been a deadlock.

So the only thing I can think to do here in, given that it doesn’t seem like the stack trace youve managed to obtain doesn’t seem like it.could hang on it’s own is to start instrumenting the code to get some.elevated visibility. I’ll provide a debug patch in the next few days

apologies, I’m not used to seeing nonstop traces, and assumed it came from the parent process.

That said, I may see the problem. quic_multistream_test uses several (under the covers) pthread_cond/pthread_mutex calls, and appears to have reversed the order of mutex_unlock and cond_signal. I’ll have a pr for you shortly to test please