tt-metal: Investigate issue with a non-deterministic `_Map_base::at` error during FD post commit

We’re seeing this non-deterministically on pipelines with a recent Python environment upgrade:

2024-02-20T04:09:47.3596440Z tests/ttnn/unit_tests/operations/test_creation.py::test_zeros[input_shapes=[2, 1280, 4, 4]] ^[[38;2;000;128;000m                  Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO    ^[[0m | Initializing device 0
2024-02-20T04:09:47.4838969Z ^[[38;2;000;128;000m                  Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO    ^[[0m | AI CLK for device 0 is:   250 MHz
2024-02-20T04:09:47.5462416Z PASSED^[[38;2;000;128;000m                  Metal^[[0m | ^[[1m^[[38;2;100;149;237mINFO    ^[[0m | Closing device 0
2024-02-20T04:09:47.6723165Z ^[[38;2;000;128;000m                     Op^[[0m | ^[[1m^[[38;2;100;149;237mINFO    ^[[0m | Program Cache: disabled and cleared.
2024-02-20T04:09:47.6727408Z terminate called after throwing an instance of 'std::out_of_range'
2024-02-20T04:09:47.6728190Z   what():  _Map_base::at
2024-02-20T04:09:47.6728756Z Fatal Python error: Aborted
2024-02-20T04:09:47.6729041Z
2024-02-20T04:09:47.6729276Z Thread 0x00007fb4d72bc700 (most recent call first):
2024-02-20T04:09:47.6734831Z   File "/usr/lib/python3.8/threading.py", line 306 in wait
2024-02-20T04:09:47.6737440Z   File "/usr/lib/python3.8/threading.py", line 558 in wait
2024-02-20T04:09:47.6739498Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
2024-02-20T04:09:47.6741288Z   File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
2024-02-20T04:09:47.6742222Z   File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap
2024-02-20T04:09:47.6742741Z
2024-02-20T04:09:47.6743031Z Thread 0x00007fb630605740 (most recent call first):
2024-02-20T04:09:47.6744513Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 137 in runtestprotocol
2024-02-20T04:09:47.6746471Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
2024-02-20T04:09:47.6748479Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6750238Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6751956Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6753695Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
2024-02-20T04:09:47.6755476Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6757270Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6759183Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6760892Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
2024-02-20T04:09:47.6762577Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
2024-02-20T04:09:47.6764592Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
2024-02-20T04:09:47.6766399Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
2024-02-20T04:09:47.6768195Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
2024-02-20T04:09:47.6769977Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
2024-02-20T04:09:47.6771696Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
2024-02-20T04:09:47.6775406Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
2024-02-20T04:09:47.6777125Z   File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/build/python_env/bin/pytest", line 8 in <module>
2024-02-20T04:09:48.0018693Z ./tests/scripts/run_python_api_unit_tests.sh: line 57: 1729937 Aborted                 (core dumped) env pytest $TT_METAL_HOME/tests/ttnn/unit_tests
2024-02-20T04:09:48.0044425Z ##[error]Process completed with exit code 134.

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 21 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Although we have other deterministic issues in nightlies

Not in a while!

@TT-billteng it is a non-deterministic problem that happens if people don’t block at the end of a test. It seems people have been seeing it more often due to timing changes from the package update commit.

I do not have cycles to make a test for this right now, but I do have a PR that improves readability of the issue in cases where we do run into this here: https://github.com/tenstorrent-metal/tt-metal/pull/5497/files