iree: Caching allocator causes wrong computation result

What happened?

Enabling the caching allocator causes numerical errors. In this branch I have modified the python binding per the suggestion of @benvanik to enable quickly the caching allocator while waiting for the Python API to catch up. There is a test there that reproduces the problem. I am not convinced that this test will reproduce reliably on other platforms.

Steps to reproduce your issue

At the aforementioned branch run

python ./tests/e2e/models/stateful_model_test/stateful_model_test.py

What component(s) does this issue relate to?

Compiler, Runtime

Version information

No response

Additional context

https://discord.com/channels/689900678990135345/706175572920762449/1073493760429789254

Edit:

A more up-to-date version of the test can be found here.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 56 (28 by maintainers)

Commits related to this issue

Most upvoted comments

image

you are absolutely correct and that’s the ticket! I’ll get this fixed now!

Thank you, Daniel - I think you’ve identified the bug. Thank you for taking the time to understand how all of this works through the Python layer.

Indeed, I’m not seeing where anything would be handling the initial data in the found case. This likely doesn’t happen much (or at all) in the steady state of operation but just at these edges (https://github.com/openxla/iree/blob/689e0fa332fda93404462636fa60fdaf96486aa8/runtime/src/iree/hal/utils/caching_allocator.c#L260).

do you mean the output tensors printed? I need to add a flag, but you can hack a larger number than 1024 in here: https://github.com/google/iree/blob/c1def3f696c3f8246249d2933b0d5532430cbd81/runtime/src/iree/tooling/trace_replay.c/#L1010-L1012