iree: Caching allocator causes wrong computation result
What happened?
Enabling the caching allocator causes numerical errors. In this branch I have modified the python binding per the suggestion of @benvanik to enable quickly the caching allocator while waiting for the Python API to catch up. There is a test there that reproduces the problem. I am not convinced that this test will reproduce reliably on other platforms.
Steps to reproduce your issue
At the aforementioned branch run
python ./tests/e2e/models/stateful_model_test/stateful_model_test.py
What component(s) does this issue relate to?
Compiler, Runtime
Version information
No response
Additional context
https://discord.com/channels/689900678990135345/706175572920762449/1073493760429789254
Edit:
A more up-to-date version of the test can be found here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 56 (28 by maintainers)
Commits related to this issue
- Upload initial_data when returning existing cached buffers. Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ incorrect). F... — committed to iree-org/iree by benvanik a year ago
- Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to iree-org/iree by benvanik a year ago
- Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to qedawkins/iree by benvanik a year ago
- Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to iree-org/iree by benvanik a year ago
- Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to NatashaKnk/iree by benvanik a year ago
you are absolutely correct and that’s the ticket! I’ll get this fixed now!
Thank you, Daniel - I think you’ve identified the bug. Thank you for taking the time to understand how all of this works through the Python layer.
Indeed, I’m not seeing where anything would be handling the initial data in the found case. This likely doesn’t happen much (or at all) in the steady state of operation but just at these edges (https://github.com/openxla/iree/blob/689e0fa332fda93404462636fa60fdaf96486aa8/runtime/src/iree/hal/utils/caching_allocator.c#L260).
do you mean the output tensors printed? I need to add a flag, but you can hack a larger number than 1024 in here: https://github.com/google/iree/blob/c1def3f696c3f8246249d2933b0d5532430cbd81/runtime/src/iree/tooling/trace_replay.c/#L1010-L1012