go: runtime: possible memory corruption caused by CL 304470 "cmd/compile, runtime: add metadata for argument printing in traceback"
OSS-Fuzz reported an issue a few weeks ago that we suspect is memory corruption caused by the runtime. This started on August 16th, so is likely a Go 1.17 issue.
A slice bounds out of range issue is being reported from calls to regexp.MustCompile(,\s*).Split
However, this is not reproducible with the inputs provided by OSS-Fuzz, so we expect something else is going on.
Below are some of the panic logs:
panic: runtime error: slice bounds out of range [:18416820578376] with length 59413
goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c0001c801f, 0x76dbc0}, 0xffffffffffffffff)
regexp/regexp.go:1266 +0x61c
github.com/google/gonids.(*Rule).option(0x10c000068000, {0x100c000096970, {0x10c0001c8016, 0x8}}, 0x10c00029a040)
github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c0001c8000, 0x630000350400}, 0x0)
github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x630000350400, 0x0, 0x10c000000601})
github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
./main.1689543426.go:21
panic: runtime error: slice bounds out of range [628255583:13888]
goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c00033601f, 0x76dbc0}, 0xffffffffffffffff)
regexp/regexp.go:1266 +0x617
github.com/google/gonids.(*Rule).option(0x10c00026cc00, {0x100c00026e190, {0x10c000336016, 0x7}}, 0x10c0001a4300)
github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c000336000, 0x62f00064a400}, 0x0)
github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x62f00064a400, 0x0, 0x10c000000601})
github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
./main.1689543426.go:21
AddressSanitizer:DEADLYSIGNAL
panic: runtime error: slice bounds out of range [473357973:29412]
goroutine 17 [running, locked to thread]:
regexp.(*Regexp).Split(0x10c0000b2640, {0x10c0002a001f, 0x76dbc0}, 0xffffffffffffffff)
regexp/regexp.go:1266 +0x617
github.com/google/gonids.(*Rule).option(0x10c0001b0180, {0x100c000280100, {0x10c0002a0016, 0xb}}, 0x10c0001ae040)
github.com/google/gonids/parser.go:675 +0x36cf
github.com/google/gonids.parseRuleAux({0x10c0002a0000, 0x632000930800}, 0x0)
github.com/google/gonids/parser.go:943 +0x6b3
github.com/google/gonids.ParseRule(...)
github.com/google/gonids/parser.go:972
github.com/google/gonids.FuzzParseRule({0x632000930800, 0x0, 0x10c000000601})
github.com/google/gonids/fuzz.go:20 +0x54
main.LLVMFuzzerTestOneInput(...)
./main.1689543426.go:21
From rsc@:
The relevant code is processing the [][]int
returned from regexp.(*Regexp).FindAllStringIndex
.
That [][]int
is prepared by repeated append:
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int {
if n < 0 {
n = len(s) + 1
}
var result [][]int
re.allMatches(s, nil, n, func(match []int) {
if result == nil {
result = make([][]int, 0, startSize)
}
result = append(result, match[0:2])
})
return result
}
Each of the match[0:2]
being appended is prepared in regexp.(*Regexp).doExecute
by:
dstCap = append(dstCap, m.matchcap...)
appending to a zero-length, non-nil slice to copy m.matchcap
.
And each of the m.matchcap
is associated with the *regexp.machine m
, which is kept in a sync.Pool
for reuse.
The specific corruption is that the integers in the [][]int
are clear non-integers (like pointers),
which suggests that either one of the appends is losing the reference accidentally during GC
or something in sync.Pool is wonky.
This could also be something strange that OSS-Fuzz is doing, and doesn’t necessarily represent a real-world use case.
/cc @golang/security
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 169 (160 by maintainers)
Commits related to this issue
- gonids: recompile go to hunt bug cf https://github.com/golang/go/issues/49075 Try to git bisect this unreproducible bug — committed to catenacyber/oss-fuzz by catenacyber 2 years ago
- gonids: recompile go to hunt bug (#7876) cf https://github.com/golang/go/issues/49075 Try to git bisect this unreproducible bug — committed to google/oss-fuzz by catenacyber 2 years ago
- gonids: echoes binary fuzz target in build log To debug https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber 2 years ago
- gonids: echoes binary fuzz target in build log (#7960) To debug https://github.com/golang/go/issues/49075 — committed to google/oss-fuzz by catenacyber 2 years ago
- gonids: recompile go to hunt bug (#7876) cf https://github.com/golang/go/issues/49075 Try to git bisect this unreproducible bug — committed to MartinPetkov/oss-fuzz by catenacyber 2 years ago
- gonids: echoes binary fuzz target in build log (#7960) To debug https://github.com/golang/go/issues/49075 — committed to MartinPetkov/oss-fuzz by catenacyber 2 years ago
- gonids: remove the git bisect logic for golang compiler Now that the commit has been identified cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber 2 years ago
- gonids: remove the git bisect logic for golang compiler Now that the commit has been identified cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber 2 years ago
- options: allow GODEBUG env variable cf https://github.com/golang/go/issues/49075 — committed to catenacyber/clusterfuzz by catenacyber 2 years ago
- gonids: remove the git bisect logic for golang compiler Now that the commit has been identified cf https://github.com/golang/go/issues/49075 And use different fuzzing options — committed to catenacyber/oss-fuzz by catenacyber 2 years ago
- gonids: remove the git bisect logic for golang compiler (#9121) Now that the commit has been identified cf https://github.com/golang/go/issues/49075 — committed to google/oss-fuzz by catenacyber 2 years ago
- options: allow GODEBUG env variable (#2874) cf https://github.com/golang/go/issues/49075 Co-authored-by: Oliver Chang <oliverchang@users.noreply.github.com> — committed to google/clusterfuzz by catenacyber 2 years ago
- gonids: remove the git bisect logic for golang compiler (#9121) Now that the commit has been identified cf https://github.com/golang/go/issues/49075 — committed to eamonnmcmanus/oss-fuzz by catenacyber 2 years ago
- runtime: fix misaligned SP for libfuzzer entry libfuzzer is written in C and so requires by the C abi that SP be aligned correctly mod 16. Normally CALLs need to have SP aligned to 0 mod 16, but beca... — committed to golang/go by randall77 a year ago
- golang: fixes libfuzzer signal handling libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber a year ago
- golang: fixes libfuzzer signal handling libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber a year ago
- golang: fixes libfuzzer signal handling libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber a year ago
- golang: fixes libfuzzer signal handling libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber a year ago
- golang: fixes libfuzzer signal handling libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small cf https://github.com/golang/go/issues/49075 — committed to catenacyber/oss-fuzz by catenacyber a year ago
- golang: fixes libfuzzer signal handling (#10314) libfuzzer needs to have its signal handler with SA_ONSTACK because golang stacks are small and not guarded against overflows Waiting for the clang... — committed to google/oss-fuzz by catenacyber a year ago
I just wanted to pipe in and say thank you to @catenacyber for doggedly running this down.
I managed to reproduce the bug locally ! But with the fuzz target built externally (not the one I build locally)
Indeed, but oss-fuzz/clusterfuzz fails to parse it the same way because of missing braces
{}
in the regexp to parse them cfGOLANG_STACK_FRAME_FUNCTION_REGEX
definition here https://github.com/google/clusterfuzz/blob/08f52cd1b9c304cf39988561f1241cee9fd5673a/src/clusterfuzz/stacktraces/constants.py#L308That led oss-fuzz to think the stack traces were different, hence the bugs were different, hence git bisect showed that the bug appeared with this formatting change.
The golang bug was not introduced by that commit, but one in https://github.com/golang/go/commit/b05903a9f6408065c390ea6c62e523d9f51853a5..https://github.com/golang/go/commit/d3853fb4e6ee2b9f873ab2e41adc0e62a82e73e4 : we have known good and bad revisions (even with the different stack trace)
So, see you in 10-ish days
Looks like bisection is making progress : 645cb62ee3 is good 162d4f9c92 is bad
270 commits to check between them
So far so good, with this patch to libFuzzer:
I’m not sure that’s a general fix, but for Go it’s enough because Go does its own
sigaltstack
. Maybe everyone who cares does their ownsigaltstack
? In that case, this may be all that’s required.45dc81d8565adb7d0a62502d039f4930e73d75e0 looks like a good candidate for the fix by @randall77
Since we now know this isn’t a recent regression, I’m moving this issue to 1.20.
Looking forward to further bisection. 😃
One commit per day, 2k commits: So approximately 11 days, maybe a few more with skips and confirmation?
It can take months (or longer!) to track down runtime bugs, particularly those without clear reproduction steps. Waiting two weeks for a git bisection seems fine.
@josharian I am starting https://github.com/google/oss-fuzz/pull/7876 to try to get the commit responsible for it This
git bisect
gives at most one result per day and we have 2k commits, but more than 2 weeks have passed and there is no progress yet on this…To define
struct sigaction
yourself in Go, you could copy the various definitions ofsigactiont
in the runtime package.Some more investigation and I think I have a theory as to what is going wrong here.
TL;DR libfuzzer is setting up a signal handler for alarms which will run on the Go stack. When the signal arrives, the Go stack overflows and corrupts nearby memory. libfuzzer needs to set up its signal handler to run on an alternate signal stack, not whatever stack happens to be running when the signal arrives.
To debug, I modified the stdlib’s regexp library. The corruption I was seeing was in
(*Regexp).FindAllStringIndex
. At the point of the innerappend
the data looks fine, but looping over theresult
slice at the end finds entries with bogus indexes. A few consecutive entries were being overwritten, like this:(Each line is the index
i
in theresult
array,&result[i][0]
, andresult[i]
. The small numbers are correct data, the 0s and large numbers are bogus. Note particularly that the corruption has a “hole” in it where the original data survived. Indexes change from run to run, but the general pattern is constant.)So the entries are getting overwritten somehow. No GCs were running, and
clobberfree=1
didn’t change the values being written, so probably not a GC thing. Unless previous objects allocated to that space were still being used somehow? In any case, I had to find the thing doing the writes.So I hacked up regexp some more to find the writer. When an entire page consists of just the backing stores of these 2-int slices, I map that page read-only, and see what faults on it. I was expecting to see code that was doing the clobbering write, but instead I got a strange error - a SEGV happened on a random instruction that can’t fault. The fault address was not specified. That made me think that the kernel was sending us the SEGV, we weren’t triggering it ourselves. And that made me think that a signal handler was triggered at that instruction, which SEGVd (or the kernel issued the SEGV because it couldn’t even set up the signal context). In the instance I was looking at, a read-only-mapped page was not too far away towards lower addresses.
A few more runs caught the signal handler in the act:
In the original bug, this handler ran just fine, but very seldomly, depending on timing, stack alignment, heap allocation order, etc., its stack frames would clobber something important. With my write fault setup it fails more often, and always before any corruption described above manifests.
I’m still confused as to the pattern of overwrites. If my theory is correct, I would expect the overwrites to be at the very top of pages, as the signal handler’s stack would have grown down from the subsequent page. But instead I see just a stencil of 8 uint64s overwritten, and although often high-ish in a page (the top 1/4), not right at the top. Unless the handler’s stack frame is large (>2KB) and mostly not written to? Maybe?
I can test more once I can build my own clang/libfuzzer. At the rate it is going, it is going to take ~8 hours 😦 Hopefully rebuilds go faster.
Other clue : if I totally disable GC (commenting out
runtime.GC()
in fuzz.go after thedebug.SetGCPercent(-1)
), there is no crashI’m not entirely sure. The failure mode before that CL is that we could take the address of a stack variable using a LEAQ before that variable was initialized, i.e. it contained junk. If the garbage collector got a hold of that pointer in that state, it would see invalid stack object contents, including bad pointers, and those bad pointers might make it think an object came back from the dead. I’m not sure how it would cause the error you are seeing though.
ASAN/libfuzzer instrumentation introduce additional function calls where the bad pointer might be observed. Again, not really sure what happens in this particular case but it is possible.
No idea, except that regabi certainly changes around what function prologs look like, which is where these LEAQs tend to be.
It is in theory possible to get better answers. I could prepare a CL like f959fb3 but it only does its change based on GOCOMPILEDEBUG=gossahash=… We could then binary search for the particular function on which applying f959fb3 fixes the issue. Then we could look at the code of just that function. The code is not that hard to do, but with the low failure rate the binary search would take a while and I’m not sure it is worth it.
What instrumentation are you referring to ? I cannot reproduce with a simple C driver instead of libFuzzer, even if the Go code is still instrumented the same. (and libFuzzer code should not executing when the bug happens) Do you suggest another test ?
Also, I do not seem to reproduce the bug when building with
-race
I can reproduce with removing the lexer goroutine in gonids I can not reproduce with replacing libFuzzer with a custom C driver… I can not reproduce with replacing the fuzz target by a direct call to regexp.Split…
I guess my next step would be to rewrite the fuzz target to use all the code in one file, and then try to minimize it…
Some
system
stack trace :It reproduced now twice in 223 runs !
And it also reproduces with my local build !
So, what should I try to debug ? Get some core dump ? Try some other values for
GOTRACEBACK
? orGODEBUG
? Add somerecover
to get more info after the panic ?I got it only once :
Now, I am trying a bash loop running this command 153 runs without crashes (over the night) so far…
So @aclements I think the commit introducing the regression is yours 9dd71ba91397c7f69571ae7f0810d64f2f38547a
https://github.com/golang/go/compare/b05903a9f6408065c390ea6c62e523d9f51853a5...d3853fb4e6ee2b9f873ab2e41adc0e62a82e73e4
Running the test under the race detector would be a step, if it works.
The stack trace looks pretty much the same as the original one. It is the same stack of functions.
No results
How could I get such an insurance ?
New fact :
d4aa72002e
is not really good, but the bug produces a different stack traceSo, I will be back to bisecting b05903a9f6…d3853fb4e
Great find, thanks @catenacyber!
Just to clarify, this bisect was pointing OSS-Fuzz at https://github.com/google/gonids/blob/master/fuzz.go#L19 and waiting for failures? We don’t have a consistent reproducer, right?
Is it possible/are there instructions for running this fuzzer locally as a reproducer? Is https://github.com/golang/go/issues/49075#issuecomment-947735939 still accurate? We control the build of Go used in https://github.com/google/oss-fuzz/blob/master/projects/gonids/build.sh#L18?
Looks like the regression ended up with commit 537cde0b4b cmd/compile, runtime: add metadata for argument printing in traceback as the culprit
What is next @josharian ?
Possibly related to #49961 , if oss-fuzz uses
reflect.Call
or friends.Side note on reproducing the issue. It happened 35 times on November 14th, among more than 2400 runs It happens about one time in 100 runs…
Thanks Ian, you answered may question.
GODEBUG
likely needs to be set up before start… which is complicated on oss-fuzz… which is the only place where the bug reproduces…There will be no more point to do it for me. The point is that it is easier to set
GODEBUG
from the fuzz target running, than outside of it before running it…So, I guess I will think on it…