depthai-core: start+stop is thread unsafe -- leads to crashes (e.g. bootloader_version test failing 50% of time)

Using the ctest instructions in this repo, or manually running /build/examples/bootloader_version.exe…leads to a hard crash maybe 50% of the time. If I run that EXE in vscode to debug it, I can see XLinkStream::XLinkStream(const XLinkConnection& conn, const std::string& name, std::size_t maxWriteSize) : streamName(name) constructor throws and nothing is catching so it hard crashes.

Setup

  • Microsoft Windows [Version 10.0.19043.1348]
  • VS2019 Community v16.11.5
  • depthai-core c1b697afa432c8d620b097ba57e20a0ffd2ec52d built static amd64 +opencv

Repro

  1. config, build
  2. cd build then ctest

Result

[ctest]  6/61 Test  #6: bootloader_config_test ..............   Passed    0.89 sec
[ctest]       Start  7: device_usbspeed_test
[ctest]  7/61 Test  #7: device_usbspeed_test ................   Passed   15.21 sec
[ctest]       Start  8: unlimited_io_connection_test
[ctest]  8/61 Test  #8: unlimited_io_connection_test ........   Passed   15.63 sec
[ctest]       Start  9: bootloader_version
[ctest]  9/61 Test  #9: bootloader_version ..................***Failed    9.17 sec
[ctest] statusfound -P
[ctest] -- arguments: C:/njs/depthai-core/build/examples/bootloader_version.exe;
[ctest] Found device with name: 14442C10C1A1C6D200-ma2480
[ctest] Version: 0.0.15
[ctest] -- After process executed,  produced the following exit code: -2147483645
[ctest] CMake Error at C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake:42 (message):
[ctest]    produced an error (-2147483645) while running
[ctest] 
[ctest] 
[ctest] 
[ctest]       Start 10: rgb_camera_control
[ctest] 10/61 Test #10: rgb_camera_control ..................***Failed    2.95 sec
[ctest] statusfound -P
[ctest] -- arguments: C:/njs/depthai-core/build/examples/rgb_camera_control.exe;
[ctest] -- After process executed,  produced the following exit code: Exit code 0xc0000135
[ctest] 
[ctest] CMake Error at C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake:42 (message):
[ctest]    produced an error (Exit code 0xc0000135...

Expected

No errors or crashes. All tests to pass.

Notes

verbose ctest

C:\njs\depthai-core\build>ctest -V -R bootloader_version
UpdateCTestConfiguration  from :C:/njs/depthai-core/build/DartConfiguration.tcl
Parse Config file:C:/njs/depthai-core/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :C:/njs/depthai-core/build/DartConfiguration.tcl
Parse Config file:C:/njs/depthai-core/build/DartConfiguration.tcl
Test project C:/njs/depthai-core/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 9
    Start 9: bootloader_version

9: Test command: "C:\Program Files\CMake\bin\cmake.exe" "-DTIMEOUT_SECONDS=10" "-P" "C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake" "C:/njs/depthai-core/build/examples/bootloader_version.exe"
9: Environment variables:
9:  UBSAN_OPTIONS=halt_on_error=1
9: Test timeout computed to be: 1500
9: statusfound -P
9: -- arguments: C:/njs/depthai-core/build/examples/bootloader_version.exe;
9: Found device with name: 14442C10C1A1C6D200-ma2480
9: Version: 0.0.15
9: -- After process executed,  produced the following exit code: -2147483645
9: CMake Error at C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake:42 (message):
9:    produced an error (-2147483645) while running
9:
9:
1/1 Test #9: bootloader_version ...............***Failed    4.50 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   4.51 sec

The following tests FAILED:
          9 - bootloader_version (Failed)
Errors while running CTest
Output from these tests are in: C:/njs/depthai-core/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

Debugger tracing

Run and debug /build/examples/bootloader_version.exe directly in vscode so I can debug the crash

Found device with name: 14442C10C1A1C6D200-ma2480
Version: 0.0.15

Exception thrown at 0x00007FF9D9384F69 in bootloader_version.exe: Microsoft C++ exception: std::runtime_error at memory location 0x0000002BDE0FF878.
Debug Error!

Program: C:\njs\depthai-core\build\examples\bootloader_version.exe
abort() has been called
(Press Retry to debug the application)

the crash is in src\xlink\XLinkStream.cpp line 33

https://github.com/luxonis/depthai-core/blob/c1b697afa432c8d620b097ba57e20a0ffd2ec52d/src/xlink/XLinkStream.cpp#L32-L34

streamID does equal INVALID_STREAM_ID. It is 0xDEADDEAD. and this->streamname = __watchdog

image

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 28 (22 by maintainers)

Commits related to this issue

Most upvoted comments

I finished the thread+ownership fixes I’ve planned for this wave. Tests above that often failed in fast loops – passed. Full 61 test suite (6 iterations) in a loop – passed

Overnight I’ll run rgb_depth_aligned.exe to see if there are any noticable memory leaks.

If all is good after the run, I’ll first make a PR for the XLink changes. I want to get some more eyes on it for style+approach. The needed depthai-changes are dependent on these XLink changes. I’ll make a PR for those after we have consensus on XLink.

good news 🎉 I’ve got a 100% test suite passing tests 1->50. This is a dramatic improvement!

test 51 rgb_depth_aligned.exe fails with [StereoDepth(3)] [error] RGB camera calibration missing, aligning to RGB won't work which likely a completely separate issue. I vaguely remember something about early units not having cali in firmware…but I didn’t think I was in that batch. https://docs.luxonis.com/en/latest/pages/calibration/

I’ll remove all my debugging logging and push the changes I made. It’s a much better approach than my earlier hacks. Can still be improved but its a great start.

The majority of these thread/ownership code issues are in depthai (not xlink). These are not xlink bugs.

These bugs point to core design/architectural issues in depthai. Depthai classes need to have new members to hold ownership of things on which they operate, a cascade of ownerships related to them (definitely connection, maybe device, etc) to populate those members, destructors that will cleanup, etc.

Your team should have conversations and whiteboard on how you want to manage ownership using the examples and repro cases and code I’ve pointed to above. This is your code, you have the corpus of knowledge. I don’t have the corpus, don’t have architectural ownership, and don’t have the role to edict how you architect. This is work your team needs to do.

My hacks are private as I’m not overhauling your classes…just quick grabbing ownership so I can start actually coding my app. Its messy, still not entirely thread safe, and definitely not for merging for wide usage.

The xlink issue is separate from all the thread/ownership issues and the xlink “bug” is separate at https://github.com/luxonis/XLink/issues/18. Those xlink changes are isolated and trivial. I can make a PR for those xlink issues.

@themarpe, yes. Will do before end of year. I’ll rebase it on to current develop branch.

More analysis. All above issues in depthai and XLink continue. Focusing on the XLink code…

XLinkDispatcherImpl.c has problems in dispatcherCloseLink() around https://github.com/luxonis/XLink/blob/1eaa307a3f0b5c390ee10c23daaf5fa45aa974e5/src/shared/XLinkDispatcherImpl.c#L464

  • It is missing semaphore protection. It needs to XLink_sem_wait() and XLink_sem_post() before operating on the stream.
  • The thread calling this function is wrongly assuming that it is the only thread that has ever gotten a packet. Errant. Imagine the app thread gets 2 packets that are stored in circular buffer slots 0 and 1. These two packets are at the “front” and marked as blocked (aka gotten/checked-out by the app). Then… this while loop runs. The loop getPacketFromStream() a packet stored in slot 2. And then immediately calls releasePacketFromStream(). Notice it has no pointers or slots identifying which packet to release. 😬 releasePacketFromStream() blindly releases/destroys the packet at the “front”. So it destroys the packet in slot 0 (instead of the packet it just get from slot 2. Crash💥 That slot 0 packet is being used by the app.
  • XLinkStreamReset() blindly memset(stream, 0) without checking for packets still being used. If the above is fixed, I think this will be safe.

I’m attempting to fix the XLink issues without introducing C++ into XLink. 🙄 Yet, it is so much easier with C++ RAII and unique/shared_ptr…but does change the interface and cascade some work. If you are working on this in parallel, let’s share ideas. If you are not yet, then I’ll continue to push forward as I need to get this fixed so my app meets my reliability standards.

I would rather code my app, but I need to stablize depthai-core enough for my use cases.

Sorry for the disappointment @diablodale . And thanks for the help.

…but since depthai is multithreaded there is no safe “duration of the call”. One must always get ownership using the intention of shared_ptr…one must pass it by value so that shared_ptr will do the refcount++ and therefore the connection will live during the duration of the call.

It does, the caller doesn’t decrement its own refcount for the duration till it returns from that function, so as long as the caller has a valid ref, the called function will have it as well.

From what I can see in your CI https://github.com/luxonis/depthai-core/runs/3735947677?check_suite_focus=true you only run two tests. You never run your full test suite. I suspect you are missing lots of crashes. I know some of your full suite needs a gui…and the test timeout dance. But tests 1-9 need no UI you should be able to CI them easily. Overall, I encourage you to run (at minimum) tests 1-11 (rgb_preview) in your CI.

We’ve yet to setup local runners with HW to test against and be tied to CI. We do however test these locally currently (unscalable, working on creating runners - security is a bit of an issue with GH self-hosted runners)

I can’t crash that EXE nor can I expect a user to powercycle OAK on every config change because the user rapidly makes changes…probably hundreds of times in a day.

I understand - I agree this not being a solution, was just a question to see what could be under the crash, as some of this behavior we’ve seen before. Makes it easier to start exploring the root cause. Still, how does running that example in a said manner manifest itself? Same behavior as with running in quick succession?

The reason I’m asking is that there are some XLink issues which are the root of how to resolve this “completely correctly”, and it gives a quick glance on the severity of jumping to them.

And thanks for the snippets and the XLink PR - will take into consideration when circling back to core.

When you are ready, I can push my hacks somewhere so you can see how ownership is missing and how I’ve got temporary hacks in place for some of the most problematic. There is definitely a better approach by shuffling classes/members as my changes are temporary hacks, repetitive, and still have some race conditions.

I agree, we will look into these once we finish current efforts.

Thanks for digging into this issue. XLink is a library provided by Intel for communication between device and host, doesn’t have the best error handling (using exit as you pointed out), we didn’t get into rewriting it, or isolating issues at deinitialization. Looking forward to the pull request.

Also found use of XLinkConnection without ownership two times in DataOutputQueue and DataInputQueue In general, a dev should never pass a shared_ptr as const& because you always (if it is a shared_ptr) have the refcount incremented. So a fix in these two areas is to have a private member of the classes a std::shared_ptr<XLinkConnection> and then pass it in const and in initializer std::move it. E.g.

DataInputQueue::DataInputQueue(const std::shared_ptr<XLinkConnection> conn, const std::string& streamName, unsigned int maxSize, bool blocking)
    : connection(std::move(conn)), queue(maxSize, blocking), name(streamName) {
    // open stream with default XLINK_USB_BUFFER_MAX_SIZE write size
    XLinkStream stream(*connection, name, device::XLINK_USB_BUFFER_MAX_SIZE);

Found the XLink issue. I’ve logged it separately in your xlink repo https://github.com/luxonis/XLink/issues/18