tt-metal: Investigate issue with a non-deterministic `_Map_base::at` error during FD post commit
We’re seeing this non-deterministically on pipelines with a recent Python environment upgrade:
2024-02-20T04:09:47.3596440Z tests/ttnn/unit_tests/operations/test_creation.py::test_zeros[input_shapes=[2, 1280, 4, 4]] ^[[38;2;000;128;000m Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO ^[[0m | Initializing device 0
2024-02-20T04:09:47.4838969Z ^[[38;2;000;128;000m Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO ^[[0m | AI CLK for device 0 is: 250 MHz
2024-02-20T04:09:47.5462416Z PASSED^[[38;2;000;128;000m Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO ^[[0m | Closing device 0
2024-02-20T04:09:47.6723165Z ^[[38;2;000;128;000m Op^[[0m | ^[[1m^[[38;2;100;149;237mINFO ^[[0m | Program Cache: disabled and cleared.
2024-02-20T04:09:47.6727408Z terminate called after throwing an instance of 'std::out_of_range'
2024-02-20T04:09:47.6728190Z what(): _Map_base::at
2024-02-20T04:09:47.6728756Z Fatal Python error: Aborted
2024-02-20T04:09:47.6729041Z
2024-02-20T04:09:47.6729276Z Thread 0x00007fb4d72bc700 (most recent call first):
2024-02-20T04:09:47.6734831Z File "/usr/lib/python3.8/threading.py", line 306 in wait
2024-02-20T04:09:47.6737440Z File "/usr/lib/python3.8/threading.py", line 558 in wait
2024-02-20T04:09:47.6739498Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
2024-02-20T04:09:47.6741288Z File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
2024-02-20T04:09:47.6742222Z File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
2024-02-20T04:09:47.6742741Z
2024-02-20T04:09:47.6743031Z Thread 0x00007fb630605740 (most recent call first):
2024-02-20T04:09:47.6744513Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 137 in runtestprotocol
2024-02-20T04:09:47.6746471Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
2024-02-20T04:09:47.6748479Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6750238Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6751956Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6753695Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
2024-02-20T04:09:47.6755476Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6757270Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6759183Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6760892Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
2024-02-20T04:09:47.6762577Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
2024-02-20T04:09:47.6764592Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
2024-02-20T04:09:47.6766399Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6768195Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6769977Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6771696Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
2024-02-20T04:09:47.6775406Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
2024-02-20T04:09:47.6777125Z File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/bin/pytest", line 8 in <module>
2024-02-20T04:09:48.0018693Z ./tests/scripts/run_python_api_unit_tests.sh: line 57: 1729937 Aborted (core dumped) env pytest $TT_METAL_HOME/tests/ttnn/unit_tests
2024-02-20T04:09:48.0044425Z ##[error]Process completed with exit code 134.
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 21 (8 by maintainers)
Commits related to this issue
- #5492: Assert before exiting completion queue thread to ensure users see that they did not end their tests with a blocking call. This is not a full fix, since we should not expect users to have to blo... — committed to tenstorrent/tt-metal by DrJessop 4 months ago
- #5492: Assert before exiting completion queue thread to ensure users see that they did not end their tests with a blocking call. This is not a full fix, since we should not expect users to have to blo... — committed to tenstorrent/tt-metal by DrJessop 4 months ago
- #5492: Assert before exiting completion queue thread to ensure users see that they did not end their tests with a blocking call. This is not a full fix, since we should not expect users to have to blo... — committed to tenstorrent/tt-metal by DrJessop 4 months ago
Although we have other deterministic issues in nightlies
Not in a while!
@TT-billteng it is a non-deterministic problem that happens if people don’t block at the end of a test. It seems people have been seeing it more often due to timing changes from the package update commit.
I do not have cycles to make a test for this right now, but I do have a PR that improves readability of the issue in cases where we do run into this here: https://github.com/tenstorrent-metal/tt-metal/pull/5497/files