go: crypto/x509: intermittent failures on darwin/amd64 with ‘x509: “…” certificate is expired’ since 2021-11-06

2022/02/14 12:08:07 failed to run `go mod download -json github.com/google/wycheproof@v0.0.0-20191219022705-2196000605e4`, output: {
	"Path": "github.com/google/wycheproof",
	"Version": "v0.0.0-20191219022705-2196000605e4",
	"Error": "github.com/google/wycheproof@v0.0.0-20191219022705-2196000605e4: Get \"https://proxy.golang.org/github.com/google/wycheproof/@v/v0.0.0-20191219022705-2196000605e4.info\": x509: “misc-sni.google.com” certificate is expired"
}
FAIL	golang.org/x/crypto/internal/wycheproof	0.555s

greplogs --dashboard -md -l -e 'x509: “misc-sni.google.com” certificate is expired'

2022-02-14T20:07:02-8634188-b2cb1bd/darwin-amd64-11_0 2022-02-11T19:36:36-f4118a5-0bde2cf/darwin-amd64-11_0 2022-01-10T21:48:09-03fcf44-1f411e9/darwin-amd64-11_0 2021-12-09T19:01:08-03fcf44-8ff254e/darwin-amd64-11_0 2021-12-08T18:06:06-5770296-7b7efd7/darwin-amd64-11_0 2021-12-08T15:15:40-03fcf44-08025a9/darwin-amd64-11_0 2021-11-12T18:57:22-c33205f-b1b6d92/darwin-amd64-10_15 2021-11-06T19:41:15-ec00351-61d789d/darwin-amd64-11_0

These failures seem to always occur when fetching from proxy.golang.org, but not for any specific package. (They occur intermittently across many different repos and tests.) However, I have only observed them on darwin/amd64 builders. I’m not sure where in the stack these failures are introduced — whether they’re due to a bug in the darwin kernel, a bug (perhaps a race?) in crypto/x509, some misconfiguration on the builders, or something else entirely.

These failures also seem to have started recently (during or shortly before the Go 1.18 code freeze).

@golang/security: since darwin/amd64 is a first class port, could someone assess whether this is a Go 1.18 regression, and whether and how we should mitigate these errors?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (21 by maintainers)

Most upvoted comments

Honestly it is amazing that it works as often as it does (most of the time!).

It works so well because of the register ABI or the generated code, which caused t.Sub(…).Seconds() inside x509 to leave the number we wanted in the floating-point register. As long as it doesn’t get overwritten due to goroutine rescheduling before the libc call, all is well.

I have confirmed this by putting a spurious math.Sin call in the middle.

This program calls x509.Verify directly, from a bunch of goroutines, over and over, on the cert chain from proxy.golang.org. It fails pretty often:

2022/02/18 16:24:56 FAIL 1 (in 2.109s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:24:57 FAIL 2 (in 3.065s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:00 FAIL 3 (in 6.659s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:02 FAIL 4 (in 8.741s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:13 FAIL 5 (in 19.139s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:17 FAIL 6 (in 23.597s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:31 FAIL 7 (in 37.134s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:31 FAIL 8 (in 37.686s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:42 FAIL 9 (in 48.810s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:44 FAIL 10 (in 50.137s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:45 FAIL 11 (in 51.371s): x509: “misc-sni.google.com” certificate is expired
2022/02/18 16:25:57 FAIL 12 (in 63.067s): x509: “misc-sni.google.com” certificate is expired

The call is

_, err = c1.Verify(x509.VerifyOptions{
	DNSName:       "proxy.golang.org",
	Intermediates: pool,
	CurrentTime:   time.Now(),
})

If I remove the CurrentTime line, then the problem goes away. This strongly suggests that the time conversion is the problem. And looking more carefully at crypto/tls, it does set CurrentTime unconditionally in handshake_client.go.

The time conversion is done in crypto/x509/internal/macos:

func TimeToCFDateRef(t time.Time) CFRef {
	secs := t.Sub(time.Date(2001, 1, 1, 0, 0, 0, 0, time.UTC)).Seconds()
	ref := CFDateCreate(int(secs))
	return ref
}

The only problem is that CFDateCreate takes a CFAbsoluteTime and a CFAbsoluteTime is seconds since 2001-01-01, yes, but as a CFTimeInterval, which in turn is a double (float64).

We are passing an integer, in an integer register, and leaving the floating-point register where CFDateCreate expects to find its argument completely uninitialized.

Honestly it is amazing that it works as often as it does (most of the time!). I wonder if SecTrustSetVerifyDate treats super weird dates as if the call never happened and falls back to the current date. Or maybe X0 is very often zero, and zero is special, or maybe something else…

In any event, this hack makes the program run without failure for many minutes, when before it failed every few seconds:

% git diff 
diff --git a/src/crypto/x509/internal/macos/corefoundation.go b/src/crypto/x509/internal/macos/corefoundation.go
index eedc41a183..00925292a5 100644
--- a/src/crypto/x509/internal/macos/corefoundation.go
+++ b/src/crypto/x509/internal/macos/corefoundation.go
@@ -12,6 +12,7 @@ package macOS
 import (
 	"errors"
 	"internal/abi"
+	"math"
 	"reflect"
 	"runtime"
 	"time"
@@ -49,7 +50,7 @@ func CFStringToString(ref CFRef) string {
 // TimeToCFDateRef converts a time.Time into an apple CFDateRef
 func TimeToCFDateRef(t time.Time) CFRef {
 	secs := t.Sub(time.Date(2001, 1, 1, 0, 0, 0, 0, time.UTC)).Seconds()
-	ref := CFDateCreate(int(secs))
+	ref := CFDateCreate(secs)
 	return ref
 }
 
@@ -171,8 +172,8 @@ func x509_CFArrayAppendValue_trampoline()
 
 //go:cgo_import_dynamic x509_CFDateCreate CFDateCreate "/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation"
 
-func CFDateCreate(seconds int) CFRef {
-	ret := syscall(abi.FuncPCABI0(x509_CFDateCreate_trampoline), kCFAllocatorDefault, uintptr(seconds), 0, 0, 0, 0)
+func CFDateCreate(seconds float64) CFRef {
+	ret := syscall(abi.FuncPCABI0(x509_CFDateCreate_trampoline), kCFAllocatorDefault, uintptr(math.Float64bits(seconds)), 0, 0, 0, 0)
 	return CFRef(ret)
 }
 func x509_CFDateCreate_trampoline()
diff --git a/src/runtime/sys_darwin_amd64.s b/src/runtime/sys_darwin_amd64.s
index 5d89cda8e6..6a65a80f2a 100644
--- a/src/runtime/sys_darwin_amd64.s
+++ b/src/runtime/sys_darwin_amd64.s
@@ -839,6 +839,7 @@ TEXT runtime·syscallNoErr(SB),NOSPLIT,$0
 	SUBQ	$16, SP
 	MOVQ	(0*8)(DI), R11// fn
 	MOVQ	(2*8)(DI), SI // a2
+	MOVQ	SI, X0 // hack for CFDateCreate
 	MOVQ	(3*8)(DI), DX // a3
 	MOVQ	(4*8)(DI), CX // a4
 	MOVQ	(5*8)(DI), R8 // a5
diff --git a/src/runtime/sys_darwin_arm64.s b/src/runtime/sys_darwin_arm64.s
index 96d2ed1076..9a3b3cc4f8 100644
--- a/src/runtime/sys_darwin_arm64.s
+++ b/src/runtime/sys_darwin_arm64.s
@@ -744,6 +744,7 @@ TEXT runtime·syscallNoErr(SB),NOSPLIT,$0
 
 	MOVD	0(R0), R12	// fn
 	MOVD	16(R0), R1	// a2
+	FMOVD	16(R0), F0 // hack for CFDateCreate
 	MOVD	24(R0), R2	// a3
 	MOVD	32(R0), R3	// a4
 	MOVD	40(R0), R4	// a5
% 

That seems to confirm the diagnosis. I will think more about the right fix.

Reopening to track cherry-pick to Go 1.18 release branch.

It works so well because of the register ABI or the generated code, which caused t.Sub(…).Seconds() inside x509 to leave the number we wanted in the floating-point register.

(I’m on vacation, but this is amazing.)