botan: Sporadic test failure with 3.1.0 (related to BOTAN_CLEAR_CPUID -- 0cd6692)

Hi folks,

Thanks for Botan. I package it in Gentoo and hit the following test failure when first bumping our package to 3.1.0:

SP800-108-Feedback(HMAC(SHA-1)) ran 120 tests in 2.72 msec all ok
SP800-108-Feedback(HMAC(SHA-256)) ran 120 tests in 2.73 msec all ok
SP800-108-Feedback(HMAC(SHA-384)) ran 120 tests in 3.02 msec all ok
SP800-108-Feedback(HMAC(SHA-512)) ran 120 tests in 2.61 msec all ok
SP800-108-Pipeline(CMAC(AES-128)) ran 120 tests in 3.87 msec 1 FAILED
Failure 1: SP800-108-Pipeline(CMAC(AES-128)) unexpected result for derived key
Produced: 7020B91FCED6BBC5A9A2F196
Expected: A2762C4FF7BC4D21E5C25245
XOR Diff: D2569550396AF6E44C60A3D3 (at src/tests/test_kdf.cpp:54)
Note 1: SP800-108-Pipeline(CMAC(AES-128)) Test # 18 SP800-108-Pipeline(CMAC(AES-128)) failed Output=A2762C4FF7BC4D21E5C25245 Secret=F9089D56D9A6C6F6BCB9992D1896510C
SP800-108-Pipeline(CMAC(AES-192)) ran 120 tests in 4.28 msec all ok
SP800-108-Pipeline(CMAC(AES-256)) ran 120 tests in 3.91 msec all ok
SP800-108-Pipeline(CMAC(TripleDES)) ran 120 tests in 1.86 msec all ok
SP800-108-Pipeline(HMAC(SHA-1)) ran 120 tests in 3.97 msec all ok
SP800-108-Pipeline(HMAC(SHA-256)) ran 120 tests in 3.90 msec all ok
SP800-108-Pipeline(HMAC(SHA-384)) ran 120 tests in 3.89 msec all ok
SP800-108-Pipeline(HMAC(SHA-512)) ran 120 tests in 3.88 msec all ok
SP800-56A(HMAC(SHA-1)) ran 200 tests in 2.92 msec all ok
SP800-56A(HMAC(SHA-224)) ran 196 tests in 3.55 msec all ok
SP800-56A(HMAC(SHA-256)) ran 196 tests in 3.55 msec all ok
SP800-56A(HMAC(SHA-384)) ran 196 tests in 3.54 msec all ok
SP800-56A(HMAC(SHA-512)) ran 192 tests in 3.96 msec all ok
[...]
zfec:
ZFEC encoding/decoding ran 10449 tests in 8.10 msec all ok
Tests complete ran 2927570 tests in 22.14 sec 1 tests failed (in kdf_kat)
 * ERROR: dev-libs/botan-3.1.0::gentoo failed (test phase):
 *   Validation tests failed

I can’t reproduce this on subsequent runs yet…

This is with GCC 13.1.1 20230708 on amd64. The machine has ECC ram and I don’t see any EDAC events though…

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Nice! Trying the diff from the PR now.

(We have a 96-cored arm64 box (and a 266-cored sparc one, but I assume your interest in that is less, although it’s great for alignment checking) if either of you two ever want access to it. It’s purely for development and testing stuff like this, so it’s no bother.)

Edit: I’m going to declare victory for now after 35 good iterations. Thank you!

#3625

I am unable to repro this locally, probably because my desktop doesn’t have enough cores 😭 so I’ll need @reneme or @thesamesam to confirm this fixes it

iff --git a/src/tests/test_zfec.cpp b/src/tests/test_zfec.cpp
index 8e873742d..199cf66c9 100644
--- a/src/tests/test_zfec.cpp
+++ b/src/tests/test_zfec.cpp
@@ -102,7 +102,7 @@ class ZFEC_KAT final : public Text_Based_Test {
       }
 };

-BOTAN_REGISTER_TEST("zfec", "zfec", ZFEC_KAT);
+BOTAN_REGISTER_SERIALIZED_TEST("zfec", "zfec", ZFEC_KAT);

 }  // namespace Botan_Tests

@thesamesam Fixes are included in the newly released 3.1.1. Thanks so much for reporting this and all your help in repro and testing.

Thanks, lemme adjust the ebuild to handle that.

@thesamesam That’s an test that is intentionally UB because that’s the only way we can check an error case. I imagine this happened because you passed the UbSan flags directly via CXXFLAGS - we expect UbSan to be set with --enable-sanitizers=undefined in which case we set a macro and skip tests of this sort.

@reneme I think it is not because of BOTAN_CLEAR_CPUID at all but because of the checks like

      static bool has_clmul()
-         { return has_cpuid_bit(CPUID_CLMUL_BIT); }
+         { return has_ssse3() && has_cpuid_bit(CPUID_CLMUL_BIT); }

so when ZFEC cleared SSSE3 in order to run its test, it implicitly also “cleared” CLMUL. So it increased the blast radius of what was affected. Before that point, when ZFEC cleared SSSE3 it would not have affected CLMUL or AESNI, etc.

The intent of that change was that using say BOTAN_CLEAR_CPUID=ssse3 actually disables use of all code using SSSE3; it would be awkward if you used that flag, and then we immediately use SSSE3 in CLMUL code or etc. But this may be too confusing in other parts of the code; I think we should revert that part of the change, at least for now.

It was looking good at first, but after 4 iterations:

$ grep -rsin failed /var/tmp/portage/dev-libs/botan-3.1.0/temp/build.log -C 5
1823-GF2m ran 131064 tests all ok
1824-gost_3410_keygen:
1825-GOST-34.10 gost_256A keygen ran 28 tests in 3.88 sec all ok
1826-GOST-34.10 secp256r1 keygen ran 28 tests in 2.55 sec all ok
1827-gost_3410_sign:
1828:pubkey/gost_3410_sign.vec ran 1 tests 1 FAILED
1829:Failure 1: Test # 1 A=0x7 B=0x5FBFF498AA938CE739B8E022FBAFEF40563F6E6A3472FC2A514C0CE9DAE23B7E Cofactor=1 Gx=2 Gy=0x8E2A8A0E65147D4BD6316030E16D19C85C97F0A9CA267122B96ABBCEA7E8FC8 Hash=Raw Msg=E53E042B67E6EC678E2E02B12A0352CE1FC6EEE0529CC088119AD872B3C1FB2D Nonce=77105C9B20BCD3122823C8CF6FCC7B956DE33814E95B7FE64FED924594DCEAB3 Oid=1.3.6.1.4.1.25258.2 Order=0x8000000000000000000000000000000150FE8A1892976154C59CFC193ACCF5B3 P=0x8000000000000000000000000000000000000000000000000000000000000431 Signature=01456C64BA4642A1653C235A98A60249BCD6D3F746B631DF928014F6C5BF9C4041AA28D2F1AB148280CD9ED56FEDA41974053554A42767B83AD043FD39DC0493 X=0x7A929ADE789BB9BE10ED359DD39A72C11B60961F49397EEE1D19CE9891EC3B28 failed with exception 'Unknown EC_Group 1.3.6.1.4.1.25258.2' (at src/tests/test_gost_3410.cpp:89)
1830-gost_3410_verify:
1831-GOST 34.10-2001/Raw signature verification ran 8 tests in 211.39 msec all ok
1832-hash_nist_mc:
1833-NIST Monte Carlo SHA-1 ran 1 tests in 9.75 msec all ok
1834-NIST Monte Carlo SHA-224 ran 1 tests in 10.31 msec all ok
--
2637-XMSS/SHAKE_16_512 verify invalid signature ran 28 tests in 519.32 msec all ok
2638-XMSS/SHAKE_20_256 verify invalid signature ran 28 tests in 284.70 msec all ok
2639-XMSS/SHAKE_20_512 verify invalid signature ran 28 tests in 542.47 msec all ok
2640-zfec:
2641-ZFEC encoding/decoding ran 10449 tests in 6.79 msec all ok
2642:Tests complete ran 2927554 tests in 24.00 sec 1 tests failed (in gost_3410_sign)
2643: * ERROR: dev-libs/botan-3.1.0::gentoo failed (test phase):
2644: *   Failed to run command: ./botan-test3 --test-threads=32
2645- *
2646- * Call stack:
2647- *     ebuild.sh, line  136:  Called src_test
2648- *   environment, line 2772:  Called edo './botan-test3' '--test-threads=32'
2649- *   environment, line 1130:  Called die
2650- * The specific snippet of code:
2651: *       "$@" || die -n "Failed to run command: $@"
2652- *
2653- * If you need support, post the output of `emerge --info '=dev-libs/botan-3.1.0::gentoo'`,
2654- * the complete build log and the output of `emerge -pqv '=dev-libs/botan-3.1.0::gentoo'`.
2655- * The complete build log is located at '/var/log/portage/build/dev-libs/botan-3.1.0:20230712-121644.log'.
2656- * For convenience, a symlink to the build log is located at '/var/tmp/portage/dev-libs/botan-3.1.0/temp/build.log'.

You’ve been looping the entire test corpus, right?

Just looping the failing kyber_keygen test didn’t give me any errors in about 4000 iterations, unfortunately.

i=1; while ./botan-test kyber_keygen; do echo $i; i=$(( i+1 )); done

Edit: same with kdf_kat in 20k iterations. 😦

i=1; while ./botan-test kdf_kat; do echo $i; i=$(( i+1 )); done

Edit 2: the full corpus failed rather quickly. 😮

I seem to be able to reproduce it.

Edit 3: 50 iterations of the test corpus on Botan 3.0.0 came back clean.