tfjs: [webgl] enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values
user reported issue with my library and after a while it was traced down to WEBGL_USE_SHAPES_UNIFORMS
which i have enabled by default for over half a year now due to massive performance advantages
issue itself is that model execution shows no issues and it returns tensors in expected shape,
but tensor data is an array with all null values instead of expected float32 values
what is strange that on majority of systems there are no issues (plus i cannot reproduce locally)
and there is only single (quite small) affected model out of 10+ used
affected model
very simple code reproduction (again, i cannot reproduce myself)
(it runs a model using predefined image input and checks output validity)
- https://vladmandic.github.io/human/test/issue-faceres.html?default: returns arrays with correct float values
- https://vladmandic.github.io/human/test/issue-faceres.html?uniforms: returns array with all null values
test has tfjs debug enabled and you can see all ops in the browser console
environment:
- tfjs: 3.20.0 (older versions are also affected)
- browser: chrome 104 and 105
- os: windows 10 pro
it’s reported on two different systems using different graphics adapters (AMD Radeon and nVidia RTX)
both affected systems report no issues looking at chrome://gpu and webgl v2 is working fine
only other thing worth noting is that user is using non-latin os locale
link to original issue: https://github.com/vladmandic/human/issues/291
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 34 (12 by maintainers)
Commits related to this issue
- webgl: Fix NaN issue (#6828) Fix #6822 Problem 1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still lar... — committed to tensorflow/tfjs by qjia7 2 years ago
- Pulling in updates to tfjs master repository (#2) (#40) * Customize setTimeout (#6694) If the setTimeout nesting level is greater than 5 and timeout is less than 4ms, timeout will be clamped to 4... — committed to CodeSmithDSMLProjects/tfjs by koyykdy 2 years ago
- Updating to tfjs master: latest (#41) * Update jasmine_util.ts (#6872) FIX * webgl: Fix NaN issue (#6828) Fix #6822 Problem 1: On some GPUs, even if a and b are both non-NaN, the value o... — committed to CodeSmithDSMLProjects/tfjs by koyykdy 2 years ago
- Updating Resizing to Latest TFJS Master Branch (#43) * Update jasmine_util.ts (#6872) FIX * webgl: Fix NaN issue (#6828) Fix #6822 Problem 1: On some GPUs, even if a and b are both non-N... — committed to CodeSmithDSMLProjects/tfjs by koyykdy 2 years ago
@vladmandic @shurshilov I have verified #6828 can fix this issue in my device. You can verify it in the next release or build from the source. If you still meet any problem, please feel free to reopen it or file a new bug. Thanks.
More information: The current issue is not that enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values anymore. But the faceres model couldn’t get correct result no matter
WEBGL_USE_SHAPES_UNIFORMSis true or false.In fact, the fix is more like a workaround. I don’t know why
isnan_customdoesn’t work for packed binary. But it works for unpacked binary. I did a bunch of experiments forCHECK_NAN_SNIPPET_PACKED. But none of them work. Paste some findings so far in case someone may have more ideas. Tested GPU: Radeon ™ RX 480 GraphicsisNaNcoming from the input scalarbnota. If only checkisnan_custom(a), we can get correct result.b’s value instead ofminimum’s result. The result seems no problem. I mean it’s expectedb’s value.vec4(isnan_custom(b))'s value. I also can’t see any abnormal. It’s all zero. But the incorrect result ofminimumshows it should have a un-zero value to let the code go toresult.xxx = isNaN.xxx ? NAN : result.xxx;0.5) is none of them ina(1.84),b(6.0)andNAN.I just found a device which can reproduce this issue. I will give a debugging to see if I can have more findings. Please stay tuned.
@shurshilov Great! We have found the right place to fix this issue. Have a quick summary here if you are interested for the reason.
It seems that each component of
isNaNis not exactly equals to0.. I suspect the value is very close to0.but larger than it. So allisNaN.xxx > 0.checking is true. And we gotNANas the result. And on other devices, we can get the expected checking, so we can’t reproduce it.After changing above code as below, we can get correct result., which using bool instead of float comparing.
@vladmandic
I think the reason is the NAN checking is still here no matter binary uniform shapes is enabled or disabled.
Weird. I thought I have changed all the related places. Do you have case to reproduce it?
if gpu hw acceleration is disabled, then entire webgl v2 codepath gets auto-disabled by default, so that is expected.
I want to remind you that this error only occurs on my device if the hardware acceleration option is enabled in the browser settings. If it is turned off, then everything seems to be working)
@vladmandic I also can’t reproduce it on my windows machine. @shurshilov I need your help to narrow down this issue to see which op’s uniforms have the problem. Can you paste the console info in
https://vladmandic.github.io/human/test/issue-faceres.html?defaultandhttps://vladmandic.github.io/human/test/issue-faceres.html?uniformsseparately so that we may know from which op it becomes abnormal? And I am also curious if all models will be impacted byWEBGL_USE_SHAPES_UNIFORMSin your environment or onlyfaceres?And please also follow below steps to verify whether the uniform APIs for webgl1 works well in your system:
all/conformance/uniformsand click therunbutton before it. Then see if all tests are pass.