foundationdb: Unexplained crash in 6.2.11
I recently encountered a crash in a 6.2.11 fdbserver process, and it’s not immediately obvious to me what went wrong. I’m capturing the stack trace here for further investigation:
crashHandler(int) at /opt/foundation/foundationdb/flow/Platform.cpp:2765
?? ??:0
Reference<ArenaBlock>::~Reference() at /opt/foundation/foundationdb/./flow/FastRef.h:114
(inlined by) Arena::~Arena() at /opt/foundation/foundationdb/./flow/Arena.h:92
(inlined by) ArenaObjectReader::~ArenaObjectReader() at /opt/foundation/foundationdb/./flow/ObjectSerializer.h:117
(inlined by) a_body1cont1 at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:651
a_body1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2200
(inlined by) DeliverActor at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2423
(inlined by) deliver at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:636
Endpoint::~Endpoint() at /opt/foundation/foundationdb/./fdbrpc/FlowTransport.h:32
(inlined by) scanPackets at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:765
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1cont3(int) at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:935
a_body1loopBody1loopBody1cont6 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3101
(inlined by) a_body1loopBody1loopBody1cont1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2945
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1break1(int) at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3006
(anonymous namespace)::ConnectionReaderActorState<(anonymous namespace)::ConnectionReaderActor>::a_body1loopBody1loopBody1loopBody1(int) at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2993
a_body1loopBody1loopBody1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2958
(inlined by) a_body1loopBody1loopBody1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2735
a_body1loopBody1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2696
(inlined by) a_body1loopBody1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2673
(inlined by) a_body1loopHead1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:2664
(inlined by) a_body1loopBody1cont3 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3255
(inlined by) a_body1loopBody1cont2when1 at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3261
(inlined by) a_callback_fire at /opt/foundation/foundationdb/.objs/fdbrpc/FlowTransport.actor.g.cpp:3275
(inlined by) fire at /opt/foundation/foundationdb/./flow/flow.h:998
void SAV<Void>::send<Void>(Void&&) at /opt/foundation/foundationdb/./flow/flow.h:446
Promise<Void>::~Promise() at /opt/foundation/foundationdb/./flow/flow.h:790
(inlined by) N2::PromiseTask::~PromiseTask() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:475
(inlined by) N2::PromiseTask::operator()() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:482
(inlined by) N2::Net2::run() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:657
main at /opt/foundation/foundationdb/fdbserver/fdbserver.actor.cpp:1802
?? ??:0
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (14 by maintainers)
With the same repro described above, I ran overnight with #2976 applied and did not see any segfaults or sev40s
Last night I ran a local fdbserver cluster on master (with the change to enable core dumps) with 20 processes configured with 20 desired proxies. I modified mako so that it does not stop the network before exiting, and ran LD_LIBRARY_PATH=lib watch -n.1 ~/build/foundationdb/bin/mako --cluster /etc/foundationdb/fdb.cluster --mode run -s 1 -x g1 overnight. This caused a segfault, and I captured a core dump. This core dump confirms that the PingReceiver received a flatbuffers message with an invalid relative offset that caused the reader to dereference unaddressable memory. This is consistent with the client using an invalid VTableSet. Separately in the client we saw data races and heap-use-after-frees under tsan where a global VTableSet object is destroyed by the main thread as the main thread exits, but the network thread is still running. #2976 fixes that issue.
Interestingly, the backtrace reported in gdb is pretty different from the backtrace reported in crashHandler: gdb:
crashHandler:
The line numbers are off for the gdb backtrace, and the crashHandler one is just bogus.
The bugs fixed in #2976 explain both the segfaults and asserts described in this issue.