tt-metal: WH: Multi-core unpad hangs non-deterministically in post-commit

It hangs on the 6th iteration of post-commit stress test (machine: t3005)

===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-16 08:11:28.865 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-16 08:11:28.927 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-16 08:11:28.953 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-16 08:11:29.401 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-16 08:11:29.401 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}

hanging waiting for program to finish:

(gdb) bt
#0  0x00007fa78414be60 in memcpy_from_device(void*, void const*, unsigned long) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#1  0x00007fa78414c04a in read_block(TTDevice*, unsigned int, unsigned int, unsigned long, unsigned int) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#2  0x00007fa784155845 in tt_SiliconDevice::read_device_memory(unsigned int*, tt_cxy_pair, unsigned int, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#3  0x00007fa78415f908 in tt_SiliconDevice::read_from_device(unsigned int*, tt_cxy_pair, unsigned long, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dcapalija/tt-metal/build/lib/libdevice.so
#4  0x00007fa7843e8cba in tt_cluster::read_dram_vec (this=<optimized out>, mem_ptr=0x6510840, dram_core=..., addr=8, size_in_bytes=4, small_access=<optimized out>)
    at /usr/include/c++/9/bits/char_traits.h:300
#5  0x00007fa7843fb710 in tt::llrt::read_hex_vec_from_core (cluster=<optimized out>, chip=0, core=..., addr=<optimized out>, size=<optimized out>)
    at /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/tt_xy_pair.h:38
#6  0x00007fa7843fc5d6 in tt::llrt::internal_::<lambda(uint64_t)>::operator() (run_mailbox_address_=8, __closure=0x65108e0) at tt_metal/llrt/llrt.cpp:429
#7  std::_Function_handler<bool(long unsigned int), tt::llrt::internal_::check_if_riscs_on_specified_core_done(tt_cluster*, int, tt::llrt::TensixRiscsOptions, const CoreCoord&)::<lambda(uint64_t)> >::_M_invoke(const std::_Any_data &, unsigned long &&) (__functor=..., __args#0=<optimized out>) at /usr/include/c++/9/bits/std_function.h:285
#8  0x00007fa7843fc437 in std::function<bool (unsigned long)>::operator()(unsigned long) const (__args#0=<optimized out>, this=0x7ffc7b436eb0) at /usr/include/c++/9/bits/std_function.h:683
#9  tt::llrt::internal_::check_if_riscs_on_specified_core_done (cluster=<optimized out>, chip_id=<optimized out>, riscs_options=<optimized out>, core=...) at tt_metal/llrt/llrt.cpp:441
#10 0x00007fa78436b4cf in tt::tt_metal::LaunchProgram (device=device@entry=0x6495380, program=..., stagger_start=stagger_start@entry=false) at tt_metal/tt_metal.cpp:508
#11 0x00007fa7847c9ff8 in tt::tt_metal::operation::detail::run_without_program_cache (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
    optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:125
#12 0x00007fa7847ce631 in tt::tt_metal::operation::run (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...}, optional_input_tensors=std::vector of length 0, capacity 0)
    at tt_eager/tt_dnn/op_library/run_operation.cpp:204
#13 0x00007fa7847cfcd1 in tt::tt_metal::operation::run_without_autoformat (operation=..., input_tensors=std::vector of length 1, capacity 1 = {...},
    optional_input_tensors=std::vector of length 0, capacity 0) at tt_eager/tt_dnn/op_library/run_operation.cpp:237
#14 0x00007fa7846c9f9c in tt::tt_metal::operation::run_without_autoformat<tt::tt_metal::Unpad> (optional_input_tensors=std::vector of length 0, capacity 0,
    input_tensors=std::vector of length 1, capacity 1 = {...}, concrete_op=...) at /usr/include/c++/9/new:174
#15 tt::tt_metal::unpad (input_tensor_a=..., output_tensor_start=..., output_tensor_end=..., mem_config=...) at tt_eager/tt_dnn/op_library/unpad/unpad_op.cpp:130
#16 0x00007fa784597777 in tt::tt_metal::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>::operator() (__closure=<optimized out>, output_mem_config=..., output_tensor_end=..., output_tensor_start=..., input_tensor=...) at tt_eager/tt_lib/csrc/tt_lib_bindings.cpp:2272
#17 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call_impl<tt::tt_metal::Tensor, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&, 0, 1, 2, 3, pybind11::detail::void_type> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1443
#18 pybind11::detail::argument_loader<tt::tt_metal::Tensor const&, std::array<unsigned int, 4> const&, std::array<unsigned int, 4> const&, tt::tt_metal::MemoryConfig const&>::call<tt::tt_metal::Tensor, pybind11::detail::void_type, tt::tt_metal::TensorModule(pybind11::module&)::<lambda(const tt::tt_metal::Tensor&, const std::array<unsigned int, 4>&, const std::array<unsigned int, 4>&, const tt::tt_metal::MemoryConfig&)>&> (f=..., this=<optimized out>) at tt_metal/third_party/pybind11/include/pybind11/detail/../cast.h:1412
#19 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator()(pybind11::detail::function_call &) (call=..., this=<optimized out>)
    at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:248
#20 0x00007fa7846173a6 in pybind11::cpp_function::dispatcher (self=0x6510840, args_in=0x7fa78496bcc0, kwargs_in=0x1) at tt_metal/third_party/pybind11/include/pybind11/pybind11.h:939

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 44 (18 by maintainers)

Most upvoted comments

FYI, this hang-signature looks similar to the hang related to UMD seen on BBE as well: https://yyz-gitlab.local.tenstorrent.com/tenstorrent/budabackend/-/issues/1931

The same test hangs on the latest main after fw-at-init changes, hash = 2229e00055af18b999a85df895bc94af098cc3d1 I haven’t ported the the polling simplification + debug hanging core change yet – so I don’t have the hang pattern data yet.

It hung on the 4th iteration post-commit stress, it does seem it’s easier to make it hang as part of post-commit. It crashed on the 414th iteration of a stand-alone test run – not sure if related, filed a separate case: https://github.com/tenstorrent-metal/tt-metal/issues/2700

Posting a longer test trace to see what ran just before, in case the system state turns out to be relevant. I’m guess it matters since 4th iteration of post-commit vs. 414th standalone.

2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:06.335 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-18 06:08:06.393 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:06.394 | WARNING  | SiliconDriver   - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 156649 tid: 140524614661952)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:06.418 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:06.853 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.912 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.913 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.913 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-wh test passed with input shape [[1, 1, 32, 32]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_hc_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0] 2023-09-18 06:08:07.918 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[1, 1, 32, 32]] on device: 0
2023-09-18 06:08:07.918 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:78 - Test pass/fail: True with Max ATOL Delta: 0.0, Max RTOL Delta: 0.0, PCC: 1.0
2023-09-18 06:08:07.920 | DEBUG    | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:79 - Test args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.TILE: 1>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM)}
2023-09-18 06:08:07.920 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:88 - transpose-cn test passed with input shape [[1, 1, 32, 32]].
PASSED                     Op | INFO     | Program Cache: disabled and cleared.

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nh_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_nw_test[input_shapes1] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes0] SKIPPED (not working for Wormhole B0)
tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cw_test[input_shapes1] SKIPPED (not working for Wormhole B0)                  Metal | INFO     | Closing device 0


=============================================================================================================== PASSES ================================================================================================================
======================================================================================================= short test summary info =======================================================================================================
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_wh_test[input_shapes0]
PASSED tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py::test_run_transpose_cn_test[input_shapes0]
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:49: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:98: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:121: not working for Wormhole B0
SKIPPED [2] tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_transpose.py:144: not working for Wormhole B0
==================================================================================================== 2 passed, 8 skipped in 3.02s =====================================================================================================
                  Metal | INFO     | Closing device driver
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.3.0 -- /home/dcapalija/tt-metal/build/python_env/bin/python
cachedir: .pytest_cache
rootdir: /home/dcapalija/tt-metal, configfile: pytest.ini
plugins: dash-2.8.1
collected 2 items

tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_unpad.py::test_run_unpad_test[input_shapes0-unpad_args0]                   Metal | INFO     | Initializing device 0
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
                 Device | INFO     | Network descriptor loaded /home/dcapalija/tt-metal/tt_metal/third_party/umd/device/../.umd/cluster_desc.yaml
2023-09-18 06:08:10.968 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-09-18 06:08:11.025 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-09-18 06:08:11.025 | WARNING  | SiliconDriver   - hwloc_set_area_membind(): failed for physical_device_id: 0 on NodeSet: {1} with errno: Input/output error (pid: 157310 tid: 140332769195840)
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-09-18 06:08:11.050 | INFO     | SiliconDriver   - Disable PCIE DMA
              LLRuntime | INFO     | AI CLK for device 0 is:   1000 MHz
2023-09-18 06:08:11.476 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.run_pytorch_ci_tests:run_single_pytorch_test:66 - Running with shape: [[5, 5, 50, 50]] on device: 0
2023-09-18 06:08:11.476 | INFO     | tests.tt_eager.python_api_testing.sweep_tests.common:run_tt_lib_test:42 - Running with args: {'dtype': [<DataType.BFLOAT16: 0>], 'layout': [<Layout.ROW_MAJOR: 0>], 'buffer_type': [<BufferType.DRAM: 0>], 'output_mem_config': tt::tt_metal::MemoryConfig(interleaved=true, buffer_type=BufferType::DRAM), 'output_tensor_start': [0, 0, 0, 0], 'output_tensor_end': [2, 1, 10, 27]}

Do we support multi-core tests now on WH? Or should I fixup and merge the single core mode changes I was working on previously? (When enabled and ops request grid size, it would return a 1x1 grid). Then we could just force all the op WH tests to set this env var until multi-core is more tested and selectively remove the flag? Might also help increase wh tests as currently wh tests are only specific shapes that would trigger single core (most are just single tile tests I think), but with the flag set you should be able to run with all the original unit test shapes unless there is some other bug.

I think it would be great to have ability to force single-core. We’d still want to run multi-core, as of now there’s no reason that they shouldn’t work – but there could be a bug and we need to de-couple single-core debug vs. multi-core debug.

We should do progressive WH testing:

  • run basic metal tests
  • run single-core OP tests
  • run multi-core OP tests