opencv: out of bounds crashes in DNN swizzleWeight() and copyWeightsSwizzled corrupts OpenCL state

The template function OCL4DNNConvSpatial<Dtype>::swizzleWeight() and its opencl kernel program cv::ocl::dnn::conv_spatial_helper_oclsrc both have bugs in them that cause out of bounds memory access, OpenCL kernel crashes, and premature release of resources.

Originally, I was getting CL_OUT_OF_RESOURCES at unexpected places in the OpenCL allocator. I then learned that errors like this are usually not occurring at the location of the return code expressing an error. This is due to the queuing behavior of OpenCL and devices. It is more likely the actual error occurs before the place of the return code. And it could be significantly before. https://stackoverflow.com/questions/3988645/cl-out-of-resources-for-2-millions-floats-with-1gb-vram/13896943

On Nvidia devices, this bug corrupts state and in some later opencl api call returns CL_OUT_OF_RESOURCES
On newest drivers (opencl 3.0 capable) on Intel UHD integrated graphics, it corrupts state and causes later OpenCL api calls to block and never return. The app freezes.
On newest Intel Opencl CPU runtime, it causes Access violation reading location exceptions.

System information (version)

OpenCV => 3.x and likely 4.x
Operating System / Platform => all
Compiler => all

Detailed description

Isolating the area of this bug is too detailed and tedious to document here. So I’ll just jump to how to repro and then describe the errant code.

Use Windows 10 and the install the Intel OpenCL CPU runtime 2021.3.0.3372
Compile OpenCV from the current code on the 3.4 branch
Edit OCL4DNNConvSpatial<Dtype>::swizzleWeight() in ocl4dnn_conv_spatial.cpp like this

if (swizzled_weights_umat.empty())
        swizzled_weights_umat.create(1, (int)alignSize(num_output_, 16) * channels_ *
                                     kernel_h_ * (int)alignSize(kernel_w_, 2),
                                     (use_half_) ? CV_16SC1 : CV_32FC1);

// new code below
    if (
        kernel_w_ == 1 &&
        kernel_h_ == 1 &&
        (channels_ / group_) == 144 &&
        num_output_ == 24 &&
        swizzled_factor == 16
    )
        CV_LOG_INFO(NULL, "debug breakpoint here");
// new code above

    if (!interleave) {

Do a full build of OpenCV
OPENCV_TEST_DATA_PATH=xxxxxxx, OPENCV_OPENCL_DEVICE=:CPU:, OPENCV_DNN_OPENCL_ALLOW_ALL_DEVICES=TRUE
Set a breakpoint at the CV_LOG_INFO line above. Prepare the examine all the variables when it hits that breakpoint.
Run test opencv_test_dnnd.exe --gtest_filter=*DNNTestNetwork.MobileNet_SSD_v2_TensorFlow*

You will eventually get many Access violation reading location xxxxxx exceptions and the program will crash. As I write above, NVidia will throw CL_ errors, and Intel GPU will freeze.

The issue is errant code in several locations of this function.

at the bottom of this function is an errant weightMat.release();. That should be removed as this variable is local, it will be released when out of scope. Having this release here will cause double-release.
The logic above of calculating kernel arguments and the kernel program itself has wrong code. I can’t tell you the correct code because I do not understand the intention of this function. HOWEVER, I can demonstrate that when the code runs and passes arguments to the kernel…inside the kernel it attempts to access memory out of bounds. Which causes silent kernel failures and corruption of state that cascades and shows up later.

When your breakpoint on the CV_LOG_INFO breaks, examine all the variables that will directly or indirectly be used as arguments to the kernel created/run a few lines down from cv::ocl::dnn::conv_spatial_helper_oclsrc. Here is a summary

ocl::Kernel oclk_copy_weight
    made from dnn::conv_spatial_helper_oclsrc
    kernel function copyWeightsSwizzled_float
    -DDtype=float

With args...
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=0, cl_mem=0000021FD4D9B050)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=1, cl_mem=0000021FD2C445C0)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=2, size=4, value=1 / 1u / 0x00000001 / 0)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=3, size=4, value=1 / 1u / 0x00000001 / 0)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=4, size=4, value=144 / 144u / 0x00000090 / 0)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=5, size=4, value=24 / 24u / 0x00000018 / 0)
OpenCV(OpenCL:0): clSetKernelArg('copyWeightsSwizzled_float', arg_index=6, size=4, value=16 / 16u / 0x00000010 / 0)

global_work_size_copy=	4608, 1, 1	
oclk_copy_weight.run(3, global_work_size_copy, NULL, false)	

Exception thrown at 0x0000021FC3250226 in opencv_test_dnnd.exe: 0xC0000005: Access violation reading location 0x0000021FC2864680. 
Exception thrown at 0x0000021FC3250226 in opencv_test_dnnd.exe: 0xC0000005: Access violation reading location 0x0000021FC2864380. 
Exception thrown at 0x0000021FC3250226 in opencv_test_dnnd.exe: 0xC0000005: Access violation reading location 0x0000021FC28642C0. 
Exception thrown at 0x0000021FC3250226 in opencv_test_dnnd.exe: 0xC0000005: Access violation reading location 0x0000021FC2864440. 
...

Now go into the kernel program itself. Here is that code with comments on each line and the math. The comments simulate get_global_id(0) returning 4607 and 4608. I choose these because the c++ code requested OpenCV use global_work_size_copy=4608, 1, 1. The actual global work size in the kernel may be 0…4607 or it could be 0…more than 4607 due to rounding up by OpenCV or the OpenCL runtime. So I choose both sides of that boundary because that is the place where bugs live. 🐛

You will see that the read and assign on the last line of the kernel program is errant. It will access memory out of bounds for both cases of get_global_id being 4607 or 4608.

weightOut is rows=1, cols=9216, CV_32FC1 = total of 9216 floats
weightIn is rows=24, cols=144, CV_32FC1 = total of 3456 floats

weightOut[4607] = weightIn[4607];  // when processing get_global_id(0) = 4607
weightOut[5120] = weightIn[4640];  // when processing get_global_id(0) = 4608
The code is accessing out of bounds of weightIn.

In addition to the mathematic out of bounds access, I am also concerned there is no boundary checking. In the kernel, get_global_id(0) may return a number that is larger than what you might expect. This can be due to OpenCV rounding up the size, or the Opencl runtime itself. Therefore, it is needed to always check boundaries before blindly operating on work items. if test can be used but perform poorly. Better is to use functions like min/max. It may be possible the modulus code in the kernel is an attempt to keep within bounds. But the attempt is errant. https://software.intel.com/content/www/us/en/develop/documentation/iocl-tec-opg/top/tips-and-tricks-for-kernel-development/avoid-handling-edge-conditions-in-kernels.html

#ifdef HALF_SUPPORT
#ifdef cl_khr_fp16
#pragma OPENCL EXTENSION cl_khr_fp16:enable
#endif
#endif
#define CONCAT(A,B) A##_##B
#define TEMPLATE(name,type) CONCAT(name,type)
__kernel void TEMPLATE(copyWeightsSwizzled, Dtype)
(
    __global Dtype* weightIn,   // UMat `weight` is rows=24, cols=144, CV_32FC1 = total of 3456 floats
    __global Dtype* weightOut,  // UMat `swizzled_weights_umat` is rows=1, cols=9216, CV_32FC1 = total of 9216 floats
    const int kernel_w,         // 1
    const int kernel_h,         // 1
    const int channels,         // 144
    const int outputs,          // 24
    const int swizzleFactor     // 16
) {
    unsigned int sX = get_global_id(0); // 0...4608 remember opencl runtime can also return higher numbers than size given on run()
    int outputSublayer = channels / swizzleFactor;           //   144/16 = 9 these are constants, should not calc at runtime
    int outputSublayerIndex = channels % swizzleFactor;      //   144%16 = 0 these are constants, should not calc at runtime
    int filter = sX / (kernel_w*kernel_h*channels);          //   4608/(1*1*144) = 32      -or- 4607/(1*1*144) = 31
    int kernel_X = sX % kernel_w;                            //   4608 % 1 = 0             -or- 4607 % 1 = 0
    int kernel_Y = (sX / kernel_w) % kernel_h;               //   (4608/1) % 1 = 0         -or- (4607/1) % 1 = 0
    int kernel_C = (sX / (kernel_w * kernel_h)) % channels;  //   (4608/(1*1)) % 144 = 32  -or- (4607/(1*1)) % 144 = 143
    int FP = filter / swizzleFactor;                         //   32 / 16 = 2              -or- 31 / 16 = 1
    int F1 = filter % swizzleFactor;                         //   32 % 16 = 0              -or- 31 % 16 = 15
    weightOut[
        // multiply has higher precedence than add
        FP * (kernel_w*kernel_h*channels*swizzleFactor) +    // 2 * (1 * 1 * 144 * 16) = 4608   -or- 1 * (1*1*144*16) = 2304
        kernel_C*(kernel_w*kernel_h*swizzleFactor) +         // 32 * (1 * 1 * 16) = 512         -or- 143 * (1*1*16) = 2288
        kernel_Y*(kernel_w*swizzleFactor) +                  // 0 * (1 * 16) = 0                -or- 0 * (1*16) = 0
        kernel_X*swizzleFactor +                             // 0 * 16 = 0                      -or- 0 * 16 = 0
        F1                                                   // 0                               -or- 15
    ]                                                        // = 4608 + 512 + 0 + 0 = 5120     -or-  = 2304+2288+0+0+15=4607
    = weightIn[
        // multiply has higher precedence than add
        filter * (kernel_w*kernel_h*channels) +              // 32 * (1*1*144) = 4608           -or- 31 * (1*1*144) = 4464
        kernel_C*(kernel_w*kernel_h) +                       // 32 * (1*1) = 32                 -or- 143 * (1*1) = 143
        kernel_Y*kernel_w +                                  // 0 * 1 = 0                       -or- 0 * 1 = 0
        kernel_X                                             // 0                               -or- 0
    ];                                                       // = 4608 + 32 + 0 + 0 = 4640      -or- 4464+143+0+0=4607
}

Issue submission checklist

I report the issue, it’s not a question
I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found solution
I updated to latest OpenCV version and the issue is still there
There is reproducer code and related data files: videos, images, onnx, etc

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 15 (15 by maintainers)

Most upvoted comments

Moving issue/concern with https://github.com/opencv/opencv/pull/20648 here as it is getting lost

Does the swizzle weight code need to initialize or write to all entities in the output weight array? I ask because the current code does not write to all output entities. I wrote code to test this and there is a large number of entities never initialized never written. Therefore, those entities have random data in them and the random data is passed further down the DNN codepath.

diablodale on Sep 8, 2021