tfjs: [webgl] enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values

user reported issue with my library and after a while it was traced down to WEBGL_USE_SHAPES_UNIFORMS
which i have enabled by default for over half a year now due to massive performance advantages

issue itself is that model execution shows no issues and it returns tensors in expected shape,
but tensor data is an array with all null values instead of expected float32 values

what is strange that on majority of systems there are no issues (plus i cannot reproduce locally)
and there is only single (quite small) affected model out of 10+ used

affected model

very simple code reproduction (again, i cannot reproduce myself)
(it runs a model using predefined image input and checks output validity)

test has tfjs debug enabled and you can see all ops in the browser console

environment:

  • tfjs: 3.20.0 (older versions are also affected)
  • browser: chrome 104 and 105
  • os: windows 10 pro

it’s reported on two different systems using different graphics adapters (AMD Radeon and nVidia RTX)
both affected systems report no issues looking at chrome://gpu and webgl v2 is working fine

only other thing worth noting is that user is using non-latin os locale

link to original issue: https://github.com/vladmandic/human/issues/291

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 34 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@vladmandic @shurshilov I have verified #6828 can fix this issue in my device. You can verify it in the next release or build from the source. If you still meet any problem, please feel free to reopen it or file a new bug. Thanks.

More information: The current issue is not that enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values anymore. But the faceres model couldn’t get correct result no matter WEBGL_USE_SHAPES_UNIFORMS is true or false.

In fact, the fix is more like a workaround. I don’t know why isnan_custom doesn’t work for packed binary. But it works for unpacked binary. I did a bunch of experiments for CHECK_NAN_SNIPPET_PACKED. But none of them work. Paste some findings so far in case someone may have more ideas. Tested GPU: Radeon ™ RX 480 Graphics

  1. The incorrect isNaN coming from the input scalar b not a. If only check isnan_custom(a), we can get correct result.
  2. Print b’s value instead of minimum’s result. The result seems no problem. I mean it’s expected b’s value.
  3. Print vec4(isnan_custom(b))'s value. I also can’t see any abnormal. It’s all zero. But the incorrect result of minimum shows it should have a un-zero value to let the code go to result.xxx = isNaN.xxx ? NAN : result.xxx;
  4. The most weird thing is that the incorrect value(0.5) is none of them in a(1.84), b(6.0) and NAN .

I just found a device which can reproduce this issue. I will give a debugging to see if I can have more findings. Please stay tuned.

@shurshilov Great! We have found the right place to fix this issue. Have a quick summary here if you are interested for the reason.

  1. WEBGL_USE_SHAPES_UNIFORMS triggers this issue. But uniform is not the reason. It works as expected. This can be observed when WEBGL_PACKED=false && WEBGL_USE_SHAPES_UNIFORMS=true. We use the totally same uniforms. But it only fails on packed binary.
  2. By comparing the packed binary and unpacked binary, the most suspect point is the way of NAN checking is different. That’s why we give the last try to remove the NAN checking or replace a new way to check it. The result proves that both of them are correct. So base on above points: I think below code snippet on @shurshilov’e device doesn’t work normally or has precision issue.
 vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0));
  
  result.r = isNaN.r > 0. ? NAN : result.r;
  result.g = isNaN.g > 0. ? NAN : result.g;
  result.b = isNaN.b > 0. ? NAN : result.b;
  result.a = isNaN.a > 0. ? NAN : result.a;

It seems that each component of isNaN is not exactly equals to 0.. I suspect the value is very close to 0. but larger than it. So all isNaN.xxx > 0. checking is true. And we got NAN as the result. And on other devices, we can get the expected checking, so we can’t reproduce it.

After changing above code as below, we can get correct result., which using bool instead of float comparing.

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

@vladmandic

  1. all uniforms enabled except binary shapes test completes with NaN values, but the values are incorrect! this is a strange one as model completes without errors, but values returned are waaay off.

I think the reason is the NAN checking is still here no matter binary uniform shapes is enabled or disabled.

note that tfjs variations are cumulative as tfjs fails during execution if building just with patched kernel ops as updated gsls code returns bool when expected value in CHECK_NAN_SNIPPET is float:

Weird. I thought I have changed all the related places. Do you have case to reproduce it?

I want to remind you that this error only occurs on my device if the hardware acceleration option is enabled in the browser settings. If it is turned off, then everything seems to be working)

if gpu hw acceleration is disabled, then entire webgl v2 codepath gets auto-disabled by default, so that is expected.

I want to remind you that this error only occurs on my device if the hardware acceleration option is enabled in the browser settings. If it is turned off, then everything seems to be working)

@vladmandic I also can’t reproduce it on my windows machine. @shurshilov I need your help to narrow down this issue to see which op’s uniforms have the problem. Can you paste the console info in https://vladmandic.github.io/human/test/issue-faceres.html?default and https://vladmandic.github.io/human/test/issue-faceres.html?uniforms separately so that we may know from which op it becomes abnormal? And I am also curious if all models will be impacted by WEBGL_USE_SHAPES_UNIFORMS in your environment or only faceres?

And please also follow below steps to verify whether the uniform APIs for webgl1 works well in your system:

  1. Open https://registry.khronos.org/webgl/sdk/tests/webgl-conformance-tests.html?version=1.0.4 for webgl1 and https://registry.khronos.org/webgl/sdk/tests/webgl-conformance-tests.html?version=2.0.1 for webgl2
  2. Search all/conformance/uniforms and click the run button before it. Then see if all tests are pass.