foundationdb: Unexplained crash in 6.2.11

I recently encountered a crash in a 6.2.11 fdbserver process, and it’s not immediately obvious to me what went wrong. I’m capturing the stack trace here for further investigation:

crashHandler(int) at /opt/foundation/foundationdb/flow/Platform.cpp:2765
?? ??:0
Reference<ArenaBlock>::~Reference() at /opt/foundation/foundationdb/./flow/FastRef.h:114
 (inlined by) Arena::~Arena() at /opt/foundation/foundationdb/./flow/Arena.h:92
 (inlined by) ArenaObjectReader::~ArenaObjectReader() at /opt/foundation/foundationdb/./flow/ObjectSerializer.h:117
 (inlined by) a_body1cont1 at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:651
a_body1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2200
 (inlined by) DeliverActor at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2423
 (inlined by) deliver at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:636
Endpoint::~Endpoint() at /opt/foundation/foundationdb/./fdbrpc/FlowTransport.h:32
 (inlined by) scanPackets at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:765
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont3(int) at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:935
a_body1loopBody1loopBody1cont6 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3101
 (inlined by) a_body1loopBody1loopBody1cont1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2945
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1break1(int) at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3006
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1loopBody1(int) at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2993
a_body1loopBody1loopBody1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2958
 (inlined by) a_body1loopBody1loopBody1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2735
a_body1loopBody1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2696
 (inlined by) a_body1loopBody1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2673
 (inlined by) a_body1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2664
 (inlined by) a_body1loopBody1cont3 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3255
 (inlined by) a_body1loopBody1cont2when1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3261
 (inlined by) a_callback_fire at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3275
 (inlined by) fire at /opt/foundation/foundationdb/./flow/flow.h:998
void SAV<Void>::send<Void>(Void&&) at /opt/foundation/foundationdb/./flow/flow.h:446
Promise<Void>::~Promise() at /opt/foundation/foundationdb/./flow/flow.h:790
 (inlined by) N2::PromiseTask::~PromiseTask() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:475
 (inlined by) N2::PromiseTask::operator()() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:482
 (inlined by) N2::Net2::run() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:657
main at /opt/foundation/foundationdb/fdbserver/fdbserver.actor.cpp:1802
?? ??:0

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (14 by maintainers)

Most upvoted comments

With the same repro described above, I ran overnight with #2976 applied and did not see any segfaults or sev40s

atn34 on Apr 21, 2020

Last night I ran a local fdbserver cluster on master (with the change to enable core dumps) with 20 processes configured with 20 desired proxies. I modified mako so that it does not stop the network before exiting, and ran LD_LIBRARY_PATH=lib watch -n.1 ~/build/foundationdb/bin/mako --cluster /etc/foundationdb/fdb.cluster --mode run -s 1 -x g1 overnight. This caused a segfault, and I captured a core dump. This core dump confirms that the PingReceiver received a flatbuffers message with an invalid relative offset that caused the reader to dereference unaddressable memory. This is consistent with the client using an invalid VTableSet. Separately in the client we saw data races and heap-use-after-frees under tsan where a global VTableSet object is destroyed by the main thread as the main thread exits, but the network thread is still running. #2976 fixes that issue.

Interestingly, the backtrace reported in gdb is pretty different from the backtrace reported in crashHandler: gdb:

#0  0x00000000019dcf2d in detail::LoadSaveHelper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> >::SerializeFun::operator()<ReplyPromise<Void> > (this=<synthetic pointer>) at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1162
#1  serializer<detail::LoadSaveHelper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> >::SerializeFun, ReplyPromise<Void> > (visitor=<synthetic pointer>...) at /home/anoyes/workspace/foundationdb/flow/ObjectSerializerTraits.h:43
#2  detail::FakeRoot<ReplyPromise<Void> >::serialize_impl<detail::LoadSaveHelper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> >::SerializeFun, 0ul> (archive=<synthetic pointer>..., this=<synthetic pointer>)
    at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1113
#3  detail::FakeRoot<ReplyPromise<Void> >::serialize<detail::LoadSaveHelper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> >::SerializeFun> (archive=<synthetic pointer>..., this=<synthetic pointer>)
    at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1107
#4  detail::LoadSaveHelper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> >::load<detail::FakeRoot<ReplyPromise<Void> > > (current=<optimized out>, member=<synthetic pointer>..., this=<synthetic pointer>)
    at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:957
#5  detail::load_helper<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> > (context=<synthetic pointer>..., current=<optimized out>, member=<synthetic pointer>...) at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1086
#6  detail::load<detail::FakeRoot<ReplyPromise<Void> >, LoadContext<ArenaObjectReader> > (context=<synthetic pointer>..., in=<optimized out>, root=<synthetic pointer>...) at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1137
#7  load_members<LoadContext<ArenaObjectReader>, ReplyPromise<Void> > (context=<synthetic pointer>..., in=<optimized out>) at /home/anoyes/workspace/foundationdb/flow/flat_buffers.h:1156
#8  _ObjectReader<ArenaObjectReader>::deserialize<ReplyPromise<Void> > (file_identifier=<optimized out>, this=<optimized out>) at /home/anoyes/workspace/foundationdb/flow/ObjectSerializer.h:85
#9  _ObjectReader<ArenaObjectReader>::deserialize<ReplyPromise<Void> > (item=<synthetic pointer>..., this=<optimized out>) at /home/anoyes/workspace/foundationdb/flow/ObjectSerializer.h:90
#10 PingReceiver::receive (this=<optimized out>, reader=...) at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:153
#11 0x00000000019d5314 in (anonymous namespace)::DeliverActorState<(anonymous namespace)::DeliverActor>::a_body1cont1 (this=0x7f4c42ae8b20, loopDepth=<optimized out>) at /home/anoyes/workspace/foundationdb/flow/Platform.h:437
#12 0x00000000019d57b0 in (anonymous namespace)::DeliverActorState<(anonymous namespace)::DeliverActor>::a_body1 (loopDepth=0, this=0x7f4c42ae8b20) at /home/anoyes/workspace/foundationdb/flow/IRandom.h:53
#13 (anonymous namespace)::DeliverActor::DeliverActor (inReadSocket=@0x7ffcbe31b00c: true, reader=..., destination=..., self=@0x7ffcbe31aff8: 0x2bb51c0, this=0x7f4c42ae8b00) at fdbrpc/FlowTransport.actor.g.cpp:2558
#14 deliver (self=@0x7ffcbe31aff8: 0x2bb51c0, destination=..., reader=..., inReadSocket=@0x7ffcbe31b00c: true) at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:645
#15 0x00000000019d5c19 in scanPackets (transport=<optimized out>, unprocessed_begin=@0x7f4c43a2b0c8: 0x7f4c42919e74 "@", e=0x7f4c42919ebc "\245\061\006\t\v", arena=..., peerAddress=..., peerProtocolVersion=...) at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.h:43
#16 0x00000000019d63f8 in (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont4 (this=0x7f4c43a2b0a0, loopDepth=2) at /home/anoyes/workspace/foundationdb/flow/Error.h:46
#17 0x00000000019d105b in (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont7 (loopDepth=2, this=0x7f4c43a2b0a0) at /opt/rh/devtoolset-8/root/usr/include/c++/8/new:169
#18 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont1 (this=0x7f4c43a2b0a0, loopDepth=2) at fdbrpc/FlowTransport.actor.g.cpp:3089
#19 0x00000000019d1f68 in (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1break1 (loopDepth=2, this=0x7f4c43a2b0a0) at /home/anoyes/workspace/foundationdb/flow/flow.h:1002
#20 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1loopBody1 (loopDepth=3, this=0x7f4c43a2b0a0) at fdbrpc/FlowTransport.actor.g.cpp:3122
#21 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1loopHead1 (loopDepth=3, this=0x7f4c43a2b0a0) at fdbrpc/FlowTransport.actor.g.cpp:3102
#22 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1 (loopDepth=2, this=0x7f4c43a2b0a0) at fdbrpc/FlowTransport.actor.g.cpp:2879
#23 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopHead1 (loopDepth=2, this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:2834
#24 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1 (loopDepth=<optimized out>, this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:2811
#25 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopHead1 (loopDepth=<optimized out>, this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:2802
#26 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1cont3 (_=..., loopDepth=<optimized out>, this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:3415
#27 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1cont2when1 (loopDepth=<optimized out>, _=..., this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:3421
#28 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1cont2when1 (_=..., loopDepth=<optimized out>, this=<optimized out>) at fdbrpc/FlowTransport.actor.g.cpp:3419
#29 (anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_callback_fire (value=..., this=0x7f4c43a2b0a0) at fdbrpc/FlowTransport.actor.g.cpp:3436
#30 ActorCallback<(anonymous namespace)::ConnectionReaderActor, 4, Void>::fire (this=0x7f4c43a2b088, value=...) at /home/anoyes/workspace/foundationdb/flow/flow.h:1003
#31 0x0000000001ad9640 in SAV<Void>::send<Void> (value=..., this=0x7f4c42688ac0) at /home/anoyes/workspace/foundationdb/flow/Error.h:63
#32 Promise<Void>::send<Void> (this=0x7f4c42e03308, value=...) at /home/anoyes/workspace/foundationdb/flow/flow.h:780
#33 N2::PromiseTask::operator() (this=0x7f4c42e03300) at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:2375
#34 N2::Net2::run (this=0x2b8cb20) at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:1129
#35 0x00000000006a5366 in main (argc=<optimized out>, argv=<optimized out>) at /home/anoyes/workspace/foundationdb/fdbserver/fdbserver.actor.cpp:1889

crashHandler:

$ addr2line -e bin/fdbserver -p -C -f -i 0x1afc975 0x7f4c457c45f0 0x19d5314 0x19d57b0 0x19d5c19 0x19d63f8 0x19d105b 0x19d1f68 0x1ad9640 0x6a5366 0x7f4c44f03505
crashHandler(int) at /home/anoyes/workspace/foundationdb/flow/Platform.cpp:2829
?? ??:0
Optional<NetworkAddress>::Optional() at /home/anoyes/workspace/foundationdb/flow/Arena.h:214
 (inlined by) a_body1cont1 at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:661
a_body1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2342
 (inlined by) DeliverActor at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2558
 (inlined by) deliver at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:645
scanPackets(TransportData*, unsigned char*&, unsigned char const*, Arena&, NetworkAddress const&, ProtocolVersion) [clone .isra.933] at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:772
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont4(int) at /home/anoyes/workspace/foundationdb/fdbrpc/FlowTransport.actor.cpp:941
a_body1loopBody1loopBody1cont7 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3249
 (inlined by) a_body1loopBody1loopBody1cont1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3089
a_body1loopBody1loopBody1break1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3141
 (inlined by) a_body1loopBody1loopBody1loopBody1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3122
 (inlined by) a_body1loopBody1loopBody1loopHead1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3102
 (inlined by) a_body1loopBody1loopBody1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2879
 (inlined by) a_body1loopBody1loopHead1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2834
 (inlined by) a_body1loopBody1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2811
 (inlined by) a_body1loopHead1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:2802
 (inlined by) a_body1loopBody1cont3 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3415
 (inlined by) a_body1loopBody1cont2when1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3421
 (inlined by) a_body1loopBody1cont2when1 at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3419
 (inlined by) a_callback_fire at /home/anoyes/build/foundationdb/fdbrpc/FlowTransport.actor.g.cpp:3436
 (inlined by) fire at /home/anoyes/workspace/foundationdb/flow/flow.h:1003
void SAV<Void>::send<Void>(Void&&) at /home/anoyes/workspace/foundationdb/flow/flow.h:447
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /home/anoyes/workspace/foundationdb/flow/flow.h:780
 (inlined by) N2::PromiseTask::operator()() at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:2375
 (inlined by) N2::Net2::run() at /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:1129
main at /home/anoyes/workspace/foundationdb/fdbserver/fdbserver.actor.cpp:1882 (discriminator 4)
?? ??:0

The line numbers are off for the gdb backtrace, and the crashHandler one is just bogus.

The bugs fixed in #2976 explain both the segfaults and asserts described in this issue.

atn34 on Apr 20, 2020