pocl: [Regression] Valgrind "invalid read" on relatively simple kernel

Running this script

  • on pocl 68273059 compiled against Debian’s LLVM/Clang 1:9~+rc2-1~exp2
  • on pocl 6f31aa89 compiled against Debian’s LLVM/Clang 1:8.0.1-3+b1

on a Haswell machine using

PYOPENCL_TEST=port valgrind --suppressions=python-simple.supp python pocl-trigger-bug.py

produces this output, which includes these warnings:

==6481== 
==6481== Invalid read of size 32
==6481==    at 0x6C91127A: _pocl_kernel_grudge_assign_0_workgroup (in /home/andreask_work/.cache/pocl/kcache/IB/DGOCMKFDFAGEGJKLHGHENBDDMIOFGBAMKFILN/grudge_assign_0/128-1-1-goffs0-smallgrid/grudge_assign_0.so)
==6481==    by 0x4A37FA2: start_thread (pthread_create.c:486)
==6481==    by 0x49684CE: clone (clone.S:95)
==6481==  Address 0x45bcd6f20 is 122,528 bytes inside a block of size 122,624 in arena "client"
==6481== 
==6481== Invalid read of size 32
==6481==    at 0x6C911280: _pocl_kernel_grudge_assign_0_workgroup (in /home/andreask_work/.cache/pocl/kcache/IB/DGOCMKFDFAGEGJKLHGHENBDDMIOFGBAMKFILN/grudge_assign_0/128-1-1-goffs0-smallgrid/grudge_assign_0.so)
==6481==    by 0x4A37FA2: start_thread (pthread_create.c:486)
==6481==    by 0x49684CE: clone (clone.S:95)
==6481==  Address 0x45bcd6f40 is 122,560 bytes inside a block of size 122,624 in arena "client"
==6481== 
==6481== Invalid read of size 32
==6481==    at 0x6C911286: _pocl_kernel_grudge_assign_0_workgroup (in /home/andreask_work/.cache/pocl/kcache/IB/DGOCMKFDFAGEGJKLHGHENBDDMIOFGBAMKFILN/grudge_assign_0/128-1-1-goffs0-smallgrid/grudge_assign_0.so)
==6481==    by 0x4A37FA2: start_thread (pthread_create.c:486)
==6481==    by 0x49684CE: clone (clone.S:95)
==6481==  Address 0x45bcd6f60 is 122,592 bytes inside a block of size 122,624 in arena "client"
==6481== 

where the suppression file python-simple.supp can be found in this gist.

This was observed in the wild in a PDE solver where it caused occasional crashes. The issue code included here is a simplified version of that kernel.

Valgrind also highlights a suspicious uninitialized access during compilation which may be related:

==6481== Conditional jump or move depends on uninitialised value(s)
==6481==    at 0x731D9CF7: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-9.so.1)
==6481==    by 0x7312A367: llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (in /usr/lib/x86_64-linux-gnu/libLLVM-9.so.1)
==6481==    by 0x72FA48A5: llvm::FPPassManager::runOnFunction(llvm::Function&) (in /usr/lib/x86_64-linux-gnu/libLLVM-9.so.1)
==6481==    by 0x72FA4B52: llvm::FPPassManager::runOnModule(llvm::Module&) (in /usr/lib/x86_64-linux-gnu/libLLVM-9.so.1)
==6481==    by 0x72FA4FFF: llvm::legacy::PassManagerImpl::run(llvm::Module&) (in /usr/lib/x86_64-linux-gnu/libLLVM-9.so.1)
==6481==    by 0x70DE9CB6: pocl_llvm_codegen (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x70D89658: llvm_codegen (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x70D8B34C: pocl_check_kernel_disk_cache (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x70D8B926: pocl_check_kernel_dlhandle_cache (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x70D5E135: program_compile_dynamic_wg_binaries (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x70D71721: POclGetProgramInfo (in /opt/pocl-master/lib/libpocl.so.2.4.0)
==6481==    by 0x6DAD93C6: pyopencl::program::get_info(unsigned int) const (wrap_cl.hpp:3838)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 40 (40 by maintainers)

Commits related to this issue

Most upvoted comments

#850 should address this.

Can you at least file an LLVM bug out of this and link that to this issue so we can keep track easier?

https://bugs.llvm.org/show_bug.cgi?id=46666

We could have a build-time option to disable the workaround as long as we’re running against LLVM with https://github.com/llvm/llvm-project/commit/411d31ad72456ba88c0b0bee0faba2b774add65f reverted.

The problem is that POCL is applying llvm.loop.parallel_accesses metadata too aggressively. In order to handle the conditional in the loop, the loop vectorizer ought to generate masked loads and stores. However, adding llvm.loop.parallel_accesses metadata like POCL does suppresses the generation of the masked loads, so it generates an unmasked load which leads to an out of bounds read.

More specifically, for a parallel loop, conditionally executed loads are assumed safe to execute speculatively (and hence with unmasked loads). This is confirmed by this comment in LoopVectorizationLegality.cpp, and this was mentioned in the language reference before it was removed. The removal of that information from the language reference seems like a bug.

The solution on POCL’s end is not to add llvm.access.group metadata for memory reads. (The llvm.access.group metadata is required for the loop to be regarded as parallel.) A crude version of this can be done by the following patch:

diff --git a/lib/llvmopencl/ParallelRegion.cc b/lib/llvmopencl/ParallelRegion.cc
index 187e8b52..4a635377 100644
--- a/lib/llvmopencl/ParallelRegion.cc
+++ b/lib/llvmopencl/ParallelRegion.cc
@@ -474,7 +474,14 @@ void ParallelRegion::AddParallelLoopMetadata(llvm::MDNode *Identifier) {
     for (BasicBlock::iterator ii = bb->begin(), ee = bb->end();
          ii != ee; ii++) {
 
-      if (ii->mayReadOrWriteMemory()) {
+      if (ii->mayReadFromMemory()) {
+        // Parallel loop metadata on memory reads also implies that
+        // if-conversion (i.e., speculative execution within a loop
+        // iteration) is safe. This isn't guaranteed.
+        continue;
+      }
+
+      if (ii->mayWriteToMemory()) {
         MDNode *NewMD = MDNode::get(bb->getContext(), Identifier);
         MDNode *OldMD = ii->getMetadata(PARALLEL_MD_NAME);
         if (OldMD != nullptr) {

You could obviously be a bit more selective if you can prove that speculative execution of loads is safe.

I took Isuru’s C++ reproducer and modified it slightly to get rid of a few extraneous uninitialized reads. See here. If I run this with valgrind and POCL 1.5, I get the “Invalid read of size 32” error. If I run with the above patch applied (and make sure to delete the cache), then I don’t get the invalid reads. Examining parallel.bc before and after the patch confirms that the masked loads are being added only after.