iree: Flaky memory error when running transform test

We’re seeing a flaky memory error when running iree/tests/transform_dialect/cpu/attention.mlir.test from the (not yet merged) https://github.com/openxla/iree/pull/13950. The error is the cryptic corrupted size vs. prev_size which comes from malloc and I think indicates we’re writing to out of bounds memory. There’s no other context and because that test pipes the compiler directly into the runtime, it’s not even clear which of those it’s in.

For background on the error, I found someone debugging it in https://stackoverflow.com/q/49628615

This initially failed in CI twice:

I was able to reproduce running 10,000 times locally:

DOCKER_HOST_WORKDIR=$PWD DOCKER_HOST_TMPDIR="$(mktemp -d)" ./build_tools/docker/docker_run.sh --env IREE_CUDA_DISABLE=1 --env=IREE_CTEST_TESTS_REGEX=iree/tests/transform_dialect/cpu/attention.mlir.test --env=IREE_CTEST_REPEAT_UNTIL_FAIL_COUNT=10000     gcr.io/iree-oss/swiftshader@sha256:c9be5cbc8467499ae71ec80f3af87b72e746e8903cd52c0be9bb5f7261acc521     ./build_tools/cmake/ctest_all.sh full-build-dir

and got a repro after 1.5 hours. We haven’t seen this on the asan build.

First step is going to be getting a way to reproduce it that doesn’t require running it serially 10k times.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 21 (16 by maintainers)

Most upvoted comments

Fantastic triaging! Thanks @GMNGeoffrey ! This is perfect! Seems like the issue is that I am trying to distribute the batch dimension (which is of size 1) across 2 threads. If I change the num_threads to 1 in the transform dialect script, I don’t get the error. I will update my patch. Thanks again for taking the time to nail this down!