opencv: Carotene (ARM HAL) uses wrong rounding in some places

System Information

  • OpenCV version: 4.8.0
  • Operating System: macOS 13.3.1 (ARM64) / Ubuntu 20.04.4 (x86_64)
  • Compiler: Apple clang version 14.0.3 / g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Detailed description

Even this (1+0)/2 gives different result on macOS and Linux:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

Steps to reproduce

Write this code:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

On macOS:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   0]

On Linux:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

Issue submission checklist

  • I report the issue, it’s not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 15 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Correct result is “all zeros”. Looks like these is some issue with ARM optimizations on OSX.

Verification: cmake <...> -DCMAKE_BUILD_TYPE=Debug -DCV_DISABLE_OPTIMIZATION=ON

A = 
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

Thank you for comment. This is only trial, however I think this fix will slow down processing. I’ll continue to investigate …

linaro@linaro-alip:~/work/opencv$ git --no-pager diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..c56f95a4e3 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
     {
         valpha = vdupq_n_f32(_alpha);
         vbeta = vdupq_n_f32(_beta);
-        vgamma = vdupq_n_f32(_gamma + 0.5);
+        vgamma = vdupq_n_f32(_gamma);
     }

     void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>

         vs1 = vmlaq_f32(vgamma, vs1, valpha);
         vs1 = vmlaq_f32(vs1, vs2, vbeta);
-        v_dst = vcvtq_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const VecTraits<u32>::vec64 & v_src0,
@@ -174,7 +174,7 @@ template <> struct wAdd<u32>

         vs1 = vmla_f32(vget_low(vgamma), vs1, vget_low(valpha));
         vs1 = vmla_f32(vs1, vs2, vget_low(vbeta));
-        v_dst = vcvt_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const u32 * src0, const u32 * src1, u32 * dst) const
diff --git a/3rdparty/carotene/src/vtransform.hpp b/3rdparty/carotene/src/vtransform.hpp
index 08841a2263..7ae38e95df 100644
--- a/3rdparty/carotene/src/vtransform.hpp
+++ b/3rdparty/carotene/src/vtransform.hpp
@@ -682,6 +682,31 @@ void vtransform(Size2D size,
     }
 }

+inline VecTraits<u32>::vec128 round_u32_f32(const float32x4_t val)
+{
+#if defined(__aarch64__) || defined(__aarch32__)
+    return vcvrnq_u32_f32(val);
+#else // armv7
+#if 1
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    static const uint32x4_t  u32_v1_0 = vdupq_n_u32(1);
+
+    const uint32x4_t round = vcvtq_u32_f32( vaddq_f32(val, f32_v0_5 ) );
+    const uint32x4_t isOdd = vandq_u32( round, u32_v1_0 );
+    const uint32x4_t isFrac0_5 = vceqq_f32(vsubq_f32(vcvtq_f32_u32(round),val), f32_v0_5 );
+    return vsubq_u32( round, vandq_u32( isOdd, isFrac0_5 ) );
+#else
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    return vcvtq_u32_f32( vaddq_f32(val, f32_v0_5) );
+#endif
+#endif
+}
+
+inline VecTraits<u32>::vec64 round_u32_f32(const float32x2_t val)
+{
+  return vcvt_u32_f32(val);
+}
+
 } }

 #endif // CAROTENE_NEON
val = 10.5
-> round = 11
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 11 - 1 = 10

val = 11.5
-> round = 12
-> isOdd = 0
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 12 - 0 = 12

val = 12.5
-> round = 13
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 13 - 1 = 12

armv7 is still more than alive. Its support is required.

May be related to #24213 and fix #24215

This issue still exists after the newest fix #24215

The result on macOS is so weird, why it is all 1s except the last one?