tensorflow: Invalid results when running TFLite + ruy computation within a NodeJS v11+ addon on ARMv7

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes, libdeepspeech.so: https://github.com/mozilla/DeepSpeech/pull/2952
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Raspbian Buster, Armbian Buster
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): r2.2, master
TensorFlow version (use command below): r2.2, master
Python version: N/A
Bazel version (if compiling from source): 2.0.0
GCC/Compiler version (if compiling from source): GCC 6.5.0 (RPi toolchain integrated in TensorFlow), GCC 7.2.1 (Linaro toolchain custom-added to TensorFlow
CUDA/cuDNN version: N/A
GPU model and memory: N/A

Describe the current behavior Model computation differs when running the library inside a nodejs process (v11.0.0+), on ARMv7 hardware

Describe the expected behavior Model computation should be the same

Standalone code to reproduce the issue Reproduction environment is complicated for now (need to build libdeepspeech, the nodejs addon, install and run and compare to non nodejs), working on a much smaller one as of now.

How much simple would this needs to be? Our setup is a bit complicated.

Our model uses floats as input, so we need EvalHybrid to use the threaded-enabled fast-path enabled by -DTFLITE_WITH_RUY_GEMV.

Building for Android:

PYTHON_BIN_PATH=/usr/bin/python PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages TF_ENABLE_XLA=0 TF_NEED_OPENCL_SYCL=0 TF_NEED_CUDA=0 TF_NEED_ROCM=0 TF_NEED_MPI=0 TF_DOWNLOAD_CLANG=0 CC_OPT_FLAGS="-march=native -Wno-sign-compare" TF_SET_ANDROID_WORKSPACE=1 ANDROID_NDK_HOME=$HOME/Documents/codaz/Mozilla/DeepSpeech/Android/android-ndk-r18b/ ANDROID_NDK_API_LEVEL=21 ANDROID_SDK_HOME=$HOME/Documents/codaz/Mozilla/DeepSpeech/Android/SDK/ ANDROID_API_LEVEL=27 ANDROID_BUILD_TOOLS_VERSION=28.0.3 ./configure && bazel clean && bazel build -s --verbose_failures --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic --config=android --config=android_arm --define=runtime=tflite --action_env ANDROID_NDK_API_LEVEL=21 --cxxopt=-std=c++11 --copt=-D_GLIBCXX_USE_C99 //native_client:libdeepspeech.so

Running on Android (Nokia 1.3, QM215 Cortex-A53 SoC):

DRX:/data/local/tmp $ LD_LIBRARY_PATH=$(pwd)/ ./deepspeech --model model_ldc93s1_16-2000.tflite --audio LDC93S1_pcms16le_1_16000.wav                                                                                                                                                                                                                                                                                    
TensorFlow: v2.2.0-rc3-31-ga6cee0345c
DeepSpeech: v0.7.0-30-gbb716efe
INFO: Initialized TensorFlow Lite runtime.
audio_format=1
num_channels=1
sample_rate=16000 (desired=16000)
bits_per_sample=16
res.buffer_size=93594
she had your dark suit in greasy wash water all year

Building for RPi3:

PYTHON_BIN_PATH=/usr/bin/python PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages TF_ENABLE_XLA=0 TF_NEED_OPENCL_SYCL=0 TF_NEED_CUDA=0 TF_NEED_ROCM=0 TF_NEED_MPI=0 TF_DOWNLOAD_CLANG=0 CC_OPT_FLAGS="-march=native -Wno-sign-compare" TF_SET_ANDROID_WORKSPACE=0 ./configure && bazel clean && bazel build -s --verbose_failures --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic --crosstool_top=@local_config_arm_compiler//:toolchain --cpu=armeabi --define=raspberry_pi_with_neon=true --host_crosstool_top=@bazel_tools//tools/cpp:toolchain --copt=-march=armv7-a --copt=-mfloat-abi=hard --copt=-mfpu=neon-fp-armv8 --copt=-DRASPBERRY_PI --copt=-D_GLIBCXX_USE_CXX11_ABI=0 --copt=-mno-unaligned-access --define=tensorflow_mkldnn_contraction_kernel=0 --define=runtime=tflite --copt=-funsafe-math-optimizations --copt=-ftree-vectorize --copt=-pipe --copt=-DTFLITE_WITH_RUY_GEMV --define=tflite_with_ruy=true -c opt --copt=-pthread --linkopt=-lpthread //native_client:libdeepspeech.so

Running (C++ binary) on RPi3:

$ ./deepspeech --model model_ldc93s1_16-2000.tflite --audio LDC93S1_pcms16le_1_16000.wav
TensorFlow: v2.2.0-rc3-31-ga6cee0345c
DeepSpeech: v0.7.0-30-gbb716efe
she had your dark suit in greasy wash water all year

Running (NodeJS binding) on RPi3:

$ ./node ~/node_modules/.bin/deepspeech --model model_ldc93s1_16-2000.tflite --audio LDC93S1_pcms16le_1_16000.wav
Loading model from file model_ldc93s1_16-2000.tflite
TensorFlow: v2.2.0-rc3-31-ga6cee0345c
DeepSpeech: v0.7.0-30-gbb716efe
static napi_value__* DeepSpeechNAPI::CreateModel(napi_env, napi_callback_info) ModelSate: 0x3287d98
static napi_value__* DeepSpeechNAPI::CreateModel(napi_env, napi_callback_info) ModelSate(int64_t): 52985240
Loaded model in 0.004686s.
Running inference.
static napi_value__* DeepSpeechNAPI::SpeechToText(napi_env, napi_callback_info) ModelSate(int64_t): 52985240
static napi_value__* DeepSpeechNAPI::SpeechToText(napi_env, napi_callback_info) ModelSate: 0x3287d98
she h yyour drk suit in greasy wash waer all year
Inference took 2.038s for 2.925s audio file.

Other info / logs I have tested many hypothesis:

changing toolchain to gcc 6.5.0 bundled by tensorflow (we use a different one by default)
re-writing the nodejs swig-generated wrapper with n-api, in a very basic form
repro on master (commit 5be613ef4f3ec2608deed653ab4815bbbcfbe7f8)
repro on master with newer ruy (commit 808ff748e0c7dc746a413fe45fa022d63e6253e8)
bisected tensorflow: first repro is when tflite + ruy get the ability to run threads (commit be369f57e9e46d03ccd62f1031f9dc484c1016de)
bisected nodejs, issue first arises in https://github.com/nodejs/node/pull/21983/commits (obviously, hard to actionate)
repro with different model size (if input size is not a multiple of 4, works, we do not use threads somehow because of https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/lite/kernels/internal/optimized/neon_tensor_utils.cc#L1210)
same code, same nodejs version runs fine on ARM64 (Armbian on S905X), also excluded the SoC itself and the distro (repro under Armbian on S905X when running multilib armv7, repro on RPi3 and RPi4)
unable to reproduce and to get indication of any weird thing happening when running under valgrind on other platforms (valgrind on armv7/raspbian seems broken, valgrind on armv7/armbian dies because of unsupported instruction produced by vfmaq_f32 in eigen)
disabling kNeon path in ruy but keeping threads, the computation works
disabling threads with kNeon enabled works
obviously verified that the input of the model is correct (dumped mfcc vectors, input states and output logits, and verified they were different only under nodejs runtime)
- input here: https://github.com/lissyx/DeepSpeech/blob/bb716efe1ead50fc822d4f5faf0f2fa757adb2d5/native_client/tflitemodelstate.cc#L293-L299
- output here: https://github.com/lissyx/DeepSpeech/blob/bb716efe1ead50fc822d4f5faf0f2fa757adb2d5/native_client/tflitemodelstate.cc#L308-L316
- verified dumping the vector values (and verified as well the copy function)
- we run several pass for the audio file, per small timesteps of 320ms, the very first output is already broken
no problem with the python bindings, java (android), even running concurrent threads (c++)
obviously tried debug build with no optimization at all
model trained on r1.15 and used on r2.2 (we produced a r2.2-trained one and there was the same issue)

Current questions I am unable to reply

is running under NodeJS exposing a bug that we have everywhere but that does not manifest?
v8 used by NodeJS is using both threads and NEON instructions, when ruy’s ARM code is also using threads and NEON in hand-written ASM?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (17 by maintainers)

Commits related to this issue

Fix #39509: Invalid computations on ARMv7 running under NodeJS — committed to lissyx/tensorflow by deleted user 4 years ago
Fix #39509: Invalid computations on ARMv7 running under NodeJS — committed to lissyx/tensorflow by lissyx 4 years ago

Most upvoted comments

Thanks for the ruy PR. Merging it at the moment. We will also need to update TensorFlow’s references to the ruy repo to point to this new commit. I’ll take care of this in a few hours.

bjacob on May 28, 2020

1b313682 is a 27-days-old commit.

I am preparing the update, it will update to 1a8b7eab.

I hope for it to be merged today.

bjacob on Jun 2, 2020