tt-metal: WH: Multi-core unpad hangs non-deterministically in post-commit
It hangs on the 6th iteration of post-commit stress test (machine: t3005)
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0] Metal | INFO | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
Device | INFO | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 08:11:28.865 | INFO | SiliconDriver - Detected 1 PCI device
2023-09-16 08:11:28.927 | INFO | SiliconDriver - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 08:11:28.953 | INFO | SiliconDriver - Disable PCIE DMA
LLRuntime | INFO | AI CLK for device 0 is: 1000 MHz
2023-09-16 08:11:29.401 | INFO | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 08:11:29.401 | INFO | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}
hanging waiting for program to finish:
(gdb) bt
#0 0x00007fa78414be60 in memcpy_from_device(void*, void const*, unsigned long) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#1 0x00007fa78414c04a in read_block(TTDevice*, unsigned int, unsigned int, unsigned long, unsigned int) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#2 0x00007fa784155845 in tt_SiliconDevice::read_device_memory(unsigned int*, tt_cxy_pair, unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#3 0x00007fa78415f908 in tt_SiliconDevice::read_from_device(unsigned int*, tt_cxy_pair, unsigned long, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#4 0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
at /usr/include/c++/9/bits/char_traits.h:300
#5 0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
#6 0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
#7 std::_Function_handler<bool(long unsigned int), tt::llrt::internal_::check_if_riscs_on_specified_core_done(tt_cluster*, int, tt::llrt::TensixRiscsOptions, const CoreCoord&)::<lambda(uint64_t)> >::_M_invoke(const std::_Any_data &, unsigned long &&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/9/bits/std_function.h:285
#8 0x00007fa7843fc437 in std::function<bool (unsigned long)>::operator()(unsigned long) const (__args#0=<optimized out>, this=0x7ffc7b436eb0) at /usr/include/c++/9/bits/std_function.h:683
#9 tt::llrt::internal_::check_if_riscs_on_specified_core_done (cluster=<optimized out>, chip_id=<optimized out>, riscs_options=<optimized out>, core=...) at tt_metal/llrt/llrt.cpp:441
#10 0x00007fa78436b4cf in tt::tt_metal::LaunchProgram (device=device@entry=0x6495380, program=..., stagger_start=stagger_start@entry=false) at tt_metal/tt_metal.cpp:508
#11 0x00007fa7847c9ff8 in tt::tt_metal::operation::detail::run_without_program_cache (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:125
#12 0x00007fa7847ce631 in tt::tt_metal::operation::run (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...}, optional_input_tensors=std::vector of length 0, capacity 0)
at tt_eager/tt_dnn/op_library/run_operation.cpp:204
#13 0x00007fa7847cfcd1 in tt::tt_metal::operation::run_without_autoformat (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:237
#14 0x00007fa7846c9f9c in tt::tt_metal::operation::run_without_autoformat<tt::tt_metal::Unpad> (optional_input_tensors=std::vector of length 0, capacity 0,
input_tensors=std::vector of length 1, capacity 1 = {...}, concrete_op=...) at /usr/include/c++/9/new:174
#15 tt::tt_metal::unpad (input_tensor_a=..., output_tensor_start=..., output_tensor_end=..., mem_config=...) at tt_eager/tt_dnn/op_library/unpad/unpad_op.cpp:130
#16 0x00007fa784597777 in tt::tt_metal::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>::operator() (__closure=<optimized out>, output_mem_config=..., output_tensor_end=..., output_tensor_start=..., input_tensor=...) at tt_eager/tt_lib/csrc/tt_lib_bindings.cpp:2272
#17 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call_impl<tt::tt_metal::Tensor, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&, 0, 1, 2, 3, pybind11::detail::void_type> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1443
#18 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call<tt::tt_metal::Tensor, pybind11::detail::void_type, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1412
#19 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator()(pybind11::detail::function_call &) (call=..., this=<optimized out>)
at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:248
#20 0x00007fa7846173a6 in pybind11::cpp_function::dispatcher (self=0x6510840, args_in=0x7fa78496bcc0, kwargs_in=0x1) at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:939
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 44 (18 by maintainers)
FYI, this hang-signature looks similar to the hang related to UMD seen on BBE as well: https://yyz-gitlab.local.tenstorrent.com/tenstorrent/budabackend/-/issues/1931
The same test hangs on the latest main after fw-at-init changes, hash = 2229e00055af18b999a85df895bc94af098cc3d1 I haven’t ported the the polling simplification + debug hanging core change yet – so I don’t have the hang pattern data yet.
It hung on the 4th iteration post-commit stress, it does seem it’s easier to make it hang as part of post-commit. It crashed on the 414th iteration of a stand-alone test run – not sure if related, filed a separate case: https://github.com/tenstorrent-metal/tt-metal/issues/2700
Posting a longer test trace to see what ran just before, in case the system state turns out to be relevant. I’m guess it matters since 4th iteration of post-commit vs. 414th standalone.
I think it would be great to have ability to force single-core. We’d still want to run multi-core, as of now there’s no reason that they shouldn’t work – but there could be a bug and we need to de-couple single-core debug vs. multi-core debug.
We should do progressive WH testing: