depthai-core: (re)start and watchdog timing is fragile; leads to device not found and startup failures

When the device is stopped and started, the device will sometimes not be found or fail to start.

While testing the fix for issue #278 I’ve isolated a reproducible failure for devices to be found and/or startup. From the discussion in the other issue, it might be related to timing, watchdog, etc. The current approach seems to be fragile.

Setup

  • Microsoft Windows [Version 10.0.19044.1348]
  • MSVS 2019 Community v16.11.7
  • depthai-core v2.13.3 built x64, debug, static lib

Repro

  1. cmake config and build depthai-core for x64, debug, static lib, all examples, all tests
  2. Run ctest with additional params to loop the tests, e.g. --repeat until-fail:3 -VV

Result

[ctest] The following tests FAILED:
[ctest] 	 23 - imu_gyroscope_accelerometer (Failed)
[ctest] 	 53 - system_information (Failed)
[ctest] 	 54 - disparity_encoding (Failed)
[ctest] 	 55 - rgb_encoding (Failed)
[ctest] 	 57 - rgb_mono_encoding (Failed)
[ctest] 	 58 - encoding_max_limit (Failed)
[ctest] Errors while running CTest
[ctest] CTest finished with return code 8

All the failures are similar. Here is full output of ctest fail on loop.zip

And two examples from that log showing similarity. Notice the tests sometimes pass but fail on later loops

...
[ctest] 23: Accelerometer [m/s^2]: x: -0.575 y: 10.036 z: -0.306 
[ctest] 23: Gyroscope timestamp: 5086 ms
[ctest] 23: Gyroscope [rad/s]-- After process executed,  produced the following exit code: Process terminated due to timeout
[ctest]     Test #23: imu_gyroscope_accelerometer .........   Passed   10.74 sec
[ctest]       Start 23: imu_gyroscope_accelerometer
[ctest] 
[ctest] 23: Test command: "C:\Program Files\CMake\bin\cmake.exe" "-DTIMEOUT_SECONDS=10" "-P" "C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake" "C:/njs/depthai-core/build/examples/imu_gyroscope_accelerometer.exe"
[ctest] 23: Environment variables: 
[ctest] 23:  DEPTHAI_LEVEL=debug
[ctest] 23:  UBSAN_OPTIONS=halt_on_error=1
[ctest] 23: Test timeout computed to be: 1500
[ctest] 23: -- found -P
[ctest] 23: -- arguments: C:/njs/depthai-core/build/examples/imu_gyroscope_accelerometer.exe;
[ctest] 23: [2021-12-01 21:20:31.744] [debug] Library information - version: 2.13.3, commit: ce43051c2ea16088437eb377f18f9ee15214229d from 2021-12-01 20:51:22 +0100, build: 2021-12-01 19:56:05 +0000
[ctest] 23: [2021-12-01 21:20:31.748] [debug] Initialize - finished
[ctest] 23: [2021-12-01 21:20:31.878] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.15.tar.xz' open: 5ms, archive read: 126ms
[ctest] 23: [2021-12-01 21:20:32.328] [debug] Resources - Archive 'depthai-device-fwp-48fca4a443d841221c94c70d5c05e7e946168636.tar.xz' open: 4ms, archive read: 578ms
[ctest] 23: Stack trace (most recent call last):
[ctest] 23: #9    Object "", at 00007FFAC9657034, in BaseThreadInitThunk
[ctest] 23: #8    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp", line 17, in mainCRTStartup [00007FF63F33D5EE]
[ctest] 23: #7    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 331, in __scrt_common_main [00007FF63F33D2BE]
[ctest] 23: #6    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 288, in __scrt_common_main_seh [00007FF63F33D3FE]
[ctest] 23: #5    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 79, in invoke_main [00007FF63F33D559]
[ctest] 23: #4    Source "C:\njs\depthai-core\examples\IMU\imu_gyroscope_accelerometer.cpp", line 36, in main [00007FF63EB47068]
[ctest] 23:          33:     imu->out.link(xlinkOut->input);
[ctest] 23:          34: 
[ctest] 23:          35:     // Pipeline is defined, now we can connect to the device
[ctest] 23:       >  36:     dai::Device d(pipeline);
[ctest] 23:          37: 
[ctest] 23:          38:     bool firstTs = false;
[ctest] 23: #3    Source "C:\njs\depthai-core\src\device\Device.cpp", line 26, in dai::Device::Device [00007FF63EFEA385]
[ctest] 23:          23: // Common explicit instantiation, to remove the need to define in header
[ctest] 23:          24: constexpr std::size_t Device::EVENT_QUEUE_MAXIMUM_SIZE;
[ctest] 23:          25: 
[ctest] 23:       >  26: Device::Device(const Pipeline& pipeline) : DeviceBase(pipeline.getOpenVINOVersion()) {
[ctest] 23:          27:     tryStartPipeline(pipeline);
[ctest] 23:          28: }
[ctest] 23: #2    Source "C:\njs\depthai-core\src\device\DeviceBase.cpp", line 256, in dai::DeviceBase::DeviceBase [00007FF63F096DAF]
[ctest] 23:         254:     // If no device found, throw
[ctest] 23:         255:     if(!found) throw std::runtime_error("No available devices");
[ctest] 23:       > 256:     init(version, false, "");
[ctest] 23:         257: }
[ctest] 23:         258: 
[ctest] 23:         259: DeviceBase::DeviceBase(OpenVINO::Version version, const char* pathToCmd) {
[ctest] 23: #1    Object "", at 00007FFAC208B460, in CxxThrowException
[ctest] 23: #0    Object "", at 00007FFAC7E54F69, in RaiseException
[ctest] 23: -- After process executed,  produced the following exit code: -529697949
[ctest] 23: CMake Error at C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake:42 (message):
[ctest] 23:    produced an error (-529697949) while running
[ctest] 23: 
[ctest] 23: 
[ctest] 23/61 Test #23: imu_gyroscope_accelerometer .........***Failed    5.75 sec

...

[ctest] 53: [14442C10C1A1C6D200] [8.159] [system] [info] Memory Usage - DDR: 0.13 / 358.55 MiB, CMX: 2.06 / 2.50 MiB, LeonOS Heap: 7.54 / 78.59 MiB, LeonRT Heap: 2.95 / 23.84 MiB
[ctest] 53: [14442C10C1A1C6D200] [8.159] [system] [info] Temperatures - Average: 38.00 ┬░C, CSS: 39.11 ┬░C, MSS 37.71 ┬░C, UPA: 37.24 ┬░C, DSS: 37.94 ┬░C
[ctest] 53: [14442C10C1A1C6D200] [8.159] [system] [info] Cpu Usage - LeonOS 0.23%, LeonRT: 0.06%
[ctest] 53: -- After process executed,  produced the following exit code: Process terminated due to timeout
[ctest]     Test #53: system_information ..................   Passed   10.94 sec
[ctest]       Start 53: system_information
[ctest] 
[ctest] 53: Test command: "C:\Program Files\CMake\bin\cmake.exe" "-DTIMEOUT_SECONDS=10" "-P" "C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake" "C:/njs/depthai-core/build/examples/system_information.exe"
[ctest] 53: Environment variables: 
[ctest] 53:  DEPTHAI_LEVEL=debug
[ctest] 53:  UBSAN_OPTIONS=halt_on_error=1
[ctest] 53: Test timeout computed to be: 1500
[ctest] 53: -- found -P
[ctest] 53: -- arguments: C:/njs/depthai-core/build/examples/system_information.exe;
[ctest] 53: [2021-12-01 21:37:23.540] [debug] Library information - version: 2.13.3, commit: ce43051c2ea16088437eb377f18f9ee15214229d from 2021-12-01 20:51:22 +0100, build: 2021-12-01 19:56:05 +0000
[ctest] 53: [2021-12-01 21:37:23.546] [debug] Initialize - finished
[ctest] 53: [2021-12-01 21:37:23.686] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.15.tar.xz' open: 6ms, archive read: 135ms
[ctest] 53: [2021-12-01 21:37:24.161] [debug] Resources - Archive 'depthai-device-fwp-48fca4a443d841221c94c70d5c05e7e946168636.tar.xz' open: 7ms, archive read: 613ms
[ctest] 53: Stack trace (most recent call last):
[ctest] 53: #9    Object "", at 00007FFAC9657034, in BaseThreadInitThunk
[ctest] 53: #8    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp", line 17, in mainCRTStartup [00007FF6A13A226E]
[ctest] 53: #7    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 331, in __scrt_common_main [00007FF6A13A1F3E]
[ctest] 53: #6    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 288, in __scrt_common_main_seh [00007FF6A13A207E]
[ctest] 53: #5    Source "d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl", line 79, in invoke_main [00007FF6A13A21D9]
[ctest] 53: #4    Source "C:\njs\depthai-core\examples\SystemLogger\system_information.cpp", line 37, in main [00007FF6A1028E14]
[ctest] 53:          34:     sysLog->out.link(xout->input);
[ctest] 53:          35: 
[ctest] 53:          36:     // Connect to device and start pipeline
[ctest] 53:       >  37:     dai::Device device(pipeline);
[ctest] 53:          38: 
[ctest] 53:          39:     // Output queue will be used to get the system info
[ctest] 53:          40:     auto qSysInfo = device.getOutputQueue("sysinfo", 4, false);
[ctest] 53: #3    Source "C:\njs\depthai-core\src\device\Device.cpp", line 26, in dai::Device::Device [00007FF6A1032085]
[ctest] 53:          23: // Common explicit instantiation, to remove the need to define in header
[ctest] 53:          24: constexpr std::size_t Device::EVENT_QUEUE_MAXIMUM_SIZE;
[ctest] 53:          25: 
[ctest] 53:       >  26: Device::Device(const Pipeline& pipeline) : DeviceBase(pipeline.getOpenVINOVersion()) {
[ctest] 53:          27:     tryStartPipeline(pipeline);
[ctest] 53:          28: }
[ctest] 53: #2    Source "C:\njs\depthai-core\src\device\DeviceBase.cpp", line 256, in dai::DeviceBase::DeviceBase [00007FF6A10EA5DF]
[ctest] 53:         254:     // If no device found, throw
[ctest] 53:         255:     if(!found) throw std::runtime_error("No available devices");
[ctest] 53:       > 256:     init(version, false, "");
[ctest] 53:         257: }
[ctest] 53:         258: 
[ctest] 53:         259: DeviceBase::DeviceBase(OpenVINO::Version version, const char* pathToCmd) {
[ctest] 53: #1    Object "", at 00007FFAC208B460, in CxxThrowException
[ctest] 53: #0    Object "", at 00007FFAC7E54F69, in RaiseException
[ctest] 53: -- After process executed,  produced the following exit code: -529697949
[ctest] 53: CMake Error at C:/njs/depthai-core/examples/cmake/ExecuteTestTimeout.cmake:42 (message):
[ctest] 53:    produced an error (-529697949) while running
[ctest] 53: 
[ctest] 53: 
[ctest]     Test #53: system_information ..................***Failed    5.22 sec

Expected

No failures to find devices on restart

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

Casual test results of windows, shared lib, debug with develop 04d5149e7cfddc4dc871d55d1a3d9ee270b76835

DEFAULT_SEARCH_TIME failed tests
2 sec 18
3 sec 6
4 sec 4
5 sec 0

Since it is the same repo steps and same line of code failing and same race-like behavior…I suspect this is all the same OP issue at the top… failure to reliably handle the shutdown/startup of the device while the app (the test suite) is trying to re-connect.

Yesterday released my first alpha for my app 🎉 so today I wanted to verify bug fixes from your team that are on develop. My app used a depthai-core private build based on v2.14.1 + my fixes that were merged to develop but not yet on main.

Since we have already identified a timing relation between app stop and/or fail…and the time when a device can again be connected… I suspect that subtle changes in my laptop, the develop branch code, or a butterfly in Japan caused just enough delay so to cause the reconnect timeout to randomly exceed its limit on 8% of the tests in a full suite run.

Thursday I’ll change the connect timeout in depthai-core and see if that makes it go away. And if I shorten it that it makes it more common. If so, I think this is just choosing a timeout that is adequate for most cases. We touched on this idea elsewhere. I’ll also run the test suite with my app’s private build to see if it repros there.

I don’t think the examples should change. They have been good at surfacing errant code. I discourage SIGINT handlers. That isn’t real-world coding. SIGINT is never in Windows real-or-synthetic. And SIGINT doesn’t help with app crashes and/or network goes down.

This isn’t about making the test+examples run pretty. I’ve written this above but my communication isn’t clear enough. My mistake. I’ll try to explain in another way…🤓

My goal here is to understand the ecosystem and how it can lead to a “device not found”. I don’t yet think there is a “bug” in code. So far, this seems more timing/configuration issue. ⚠ However, I can often reproduce it in the examples and in depthai_demo.py. These ‘device not found’ are not rare. They are often, and occurs across C++, Python, release, debug, compiled by me, and python wheel compiled by you. If it happens that often, then this is something that needs my investigation.

Real-world use cases from my customers… but substituting Luxonis devices that have this “not found” issue:

  • Multiple Luxonis devices are hanging 6 meters overhead during a live music performance… and the band’s app crashes… and on restart of the app gets “device not found” for all three devices. A person can’t physically go to the devices.
  • Multiple Luxonis devices are part of a interactive museum installation. The app crashes. The museum has paying customers there for that installation. Sending a tech onsite is hours downtime. And turning off the power to the whole installation is excessive.

These are likely scenarios by my customers. Most of my customers use Microsoft Kinect devices. To use Luxonis devices, they need from me a solution that if their app crashes, and they restart their app, their app can immediately find the Luxonis devices and continue quickly.

Give me some more time to investigate. ⌚👍