pebble: Blocked on alloc under macOS

I’m running a local roachprod cluster on macOS with 8 nodes (cockroachdb/cockroach@ce3b29b71f), and creating a table with some data hangs forever:

$ roachprod create local -n 8
$ roachprod start local --racks 3
CREATE DATABASE test;
ALTER DATABASE test CONFIGURE ZONE USING num_replicas = 5, range_min_bytes = 1e6, range_max_bytes=10e6;
USE test;
CREATE TABLE data AS SELECT id, REPEAT('x', 1024) FROM generate_series(1, 1e6) AS id;

The logs kept repeating these messages over and over:

E210407 14:01:42.440401 2322884 kv/kvserver/queue.go:1093 ⋮ [n6,raftsnapshot,s6,r78/2:‹/{Table/53/1/-…-Max}›] 5294  (n8,s8):3: remote couldn't accept VIA_SNAPSHOT_QUEUE snapshot ‹c40b7a30› at applied index 13 with error: ‹[n8,s8],r78: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n8,s8,r69/3:/{Table/53/1/-…-Max}] (incoming /{Table/53/1/-9222246136947506714-Max})›

Range 69 is stuck applying Raft entries before a range split, which is why the GC never goes through:

Screenshot 2021-04-07 at 16 06 09

Looking at the goroutine stacks, we found this Pebble call which has been blocked on alloc for 67 minutes:

goroutine 250 [syscall, 67 minutes]:
github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc(0x79, 0x1, 0x0)
	_cgo_gotypes.go:42 +0x49
github.com/cockroachdb/pebble/internal/manual.New(0x79, 0x73, 0x1056f1, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/manual/manual.go:40 +0x3d
github.com/cockroachdb/pebble/internal/cache.newValue(0x59, 0x2)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value_normal.go:38 +0x38
github.com/cockroachdb/pebble/internal/cache.(*Cache).Alloc(...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:696
github.com/cockroachdb/pebble/sstable.(*Reader).readBlock(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0, 0x3fd9384950000000, 0x6, 0x7)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1910 +0x165
github.com/cockroachdb/pebble/sstable.(*Reader).readMetaindex(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2012 +0x7d
github.com/cockroachdb/pebble/sstable.NewReader(0x13c2d040, 0xc0004c6698, 0xc000c62460, 0xc5590a0, 0xc0014b5560, 0x8718864, 0x18, 0xc001cc2440, 0x1, 0x1, ...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2309 +0x348
github.com/cockroachdb/pebble.ingestLoad1(0xc0012a01e0, 0xc0016359a0, 0x49, 0x2, 0x73, 0x0, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:60 +0x2cd
github.com/cockroachdb/pebble.ingestLoad(0xc0012a01e0, 0xc0025f6ae0, 0x1, 0x1, 0x2, 0xc00381d718, 0x1, 0x1, 0xc001cc25c0, 0x10000c001cc25f0, ...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:161 +0x1a5
github.com/cockroachdb/pebble.(*DB).Ingest(0xc0007e0400, 0xc0025f6ae0, 0x1, 0x1, 0x1, 0xc0025f6ae0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/ingest.go:528 +0x19e
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).IngestExternalFiles(0xc001281380, 0x9bbdf50, 0xc0026774d0, 0xc0025f6ae0, 0x1, 0x1, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1110 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.addSSTablePreApply(0x9bbdf50, 0xc0026774d0, 0xc000ec0000, 0x9d0ad68, 0xc001281380, 0x9c08d48, 0xc001222580, 0x7, 0x27, 0xc004cf4000, ...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:577 +0x762
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch(0xc000b43be0, 0x9bbdf50, 0xc0026774d0, 0xc00106e008, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:616 +0xcfc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage(0xc000b43be0, 0x9bbe6f8, 0xc00106e008, 0xc001cc3338, 0x5ae3c4b, 0xc000b43df8, 0xc000b43e28)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:507 +0x3be
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter(0x9c005c0, 0xc000b43df8, 0xc001cc3470, 0x0, 0x0, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:175 +0x142
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc001cc3958, 0x9bbdf50, 0xc0026774d0, 0x9c005c0, 0xc000b43dc8, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:280 +0x185
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc001cc3958, 0x9bbdf50, 0xc0026774d0, 0x1, 0x8760951)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:247 +0xc5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc000b43b00, 0x9bbdf50, 0xc0026774d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:796 +0xfad
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc000b43b00, 0x9bbdf50, 0xc0026774d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:459 +0x113
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc000e72a00, 0x9bbdf50, 0xc000f1ff20, 0x45)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:523 +0x134
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc001854b40, 0x9bbdf50, 0xc000f1ff20)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:284 +0x2c2
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0014c0580, 0x9bbdf50, 0xc000f1ff20, 0xc000e83340, 0xc001ad8e70)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351 +0xb9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0xfc

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 26 (14 by maintainers)

Most upvoted comments

This turned out to be caused by cgosymbolizer, which was imported in CockroachDB, and thus it’s not a problem with Pebble. The CockroachDB issue will be resolved by cockroachdb/cockroach#63737.

It’s Go 1.14.

CRDB Go Behavior
v20.2.7 download 1.13.14
v21.1.0-beta.2 download 1.15.11
master @ 84a84d56b3 1.16.3
master @ 84a84d56b3 1.15.11
master @ 84a84d56b3 1.13.15 Won’t compile
release-20.2 @ ae0d209448 1.16.3
release-20.2 @ ae0d209448 1.14.15
release-20.2 @ ae0d209448 1.13.15

Release notes say:

Goroutines are now asynchronously preemptible. As a result, loops without function calls no longer potentially deadlock the scheduler or significantly delay garbage collection. This is supported on all platforms except windows/arm, darwin/arm, js/wasm, and plan9/*.

A consequence of the implementation of preemption is that on Unix systems, including Linux and macOS systems, programs built with Go 1.14 will receive more signals than programs built with earlier releases. This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again. For more information about this see man 7 signal for Linux systems or similar documentation for other systems.

This seems like it could cause what we’re seeing, but I’m not sure if it’s our responsibility or the runtime’s in this case – I don’t believe calloc is a syscall, and the syscall the goroutine refers to is the Cgo switch to the system stack. In any case, I’ll try to fiddle around with some signal handlers.