iree: Caching allocator causes wrong computation result

What happened?

Enabling the caching allocator causes numerical errors. In this branch I have modified the python binding per the suggestion of @benvanik to enable quickly the caching allocator while waiting for the Python API to catch up. There is a test there that reproduces the problem. I am not convinced that this test will reproduce reliably on other platforms.

Steps to reproduce your issue

At the aforementioned branch run

python ./tests/e2e/models/stateful_model_test/stateful_model_test.py

What component(s) does this issue relate to?

Compiler, Runtime

Version information

No response

Additional context

https://discord.com/channels/689900678990135345/706175572920762449/1073493760429789254

Edit:

A more up-to-date version of the test can be found here.

About this issue

Original URL
State: closed
Created a year ago
Comments: 56 (28 by maintainers)

Commits related to this issue

Upload initial_data when returning existing cached buffers. Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ incorrect). F... — committed to iree-org/iree by benvanik a year ago
Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to iree-org/iree by benvanik a year ago
Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to qedawkins/iree by benvanik a year ago
Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to iree-org/iree by benvanik a year ago
Upload initial_data when returning existing cached buffers. (#12660) Previously the code was double initializing new buffers (stupid but correct) while not initializing old buffers (stupid _and_ inc... — committed to NatashaKnk/iree by benvanik a year ago

Most upvoted comments

you are absolutely correct and that’s the ticket! I’ll get this fixed now!

benvanik on Mar 16, 2023

Thank you, Daniel - I think you’ve identified the bug. Thank you for taking the time to understand how all of this works through the Python layer.

Indeed, I’m not seeing where anything would be handling the initial data in the found case. This likely doesn’t happen much (or at all) in the steady state of operation but just at these edges (https://github.com/openxla/iree/blob/689e0fa332fda93404462636fa60fdaf96486aa8/runtime/src/iree/hal/utils/caching_allocator.c#L260).

stellaraccident on Mar 16, 2023

do you mean the output tensors printed? I need to add a flag, but you can hack a larger number than 1024 in here: https://github.com/google/iree/blob/c1def3f696c3f8246249d2933b0d5532430cbd81/runtime/src/iree/tooling/trace_replay.c/#L1010-L1012

benvanik on Feb 28, 2023