ComputeLibrary: Examples benchmarking problem in RaspberryPi 3B+

Output of ‘strings libarm_compute.so | grep arm_compute_version’:

arm_compute_version=v0.0-unreleased Build options: {‘arch’: ‘arm64-v8a’, ‘debug’: ‘0’, ‘benchmark’: ‘1’, ‘benchmark_tests’: ‘1’, ‘opencl’: ‘0’, ‘neon’: ‘1’, ‘cppthreads’: ‘1’, ‘Werror’: ‘0’} Git hash=b’05e5644715c678773abaf180222a33959ee0dadb’

Platform: RaspberryPi 3B+ Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 4 Model name: Cortex-A53 Stepping: r0p4 CPU max MHz: 1400.0000 CPU min MHz: 600.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm crc32 cpuid

Operating System: https://github.com/sakaki-/gentoo-on-rpi3-64bit

Problem description: I am using following build command.

scons arch=arm64-v8a benchmark=1 benchmark_tests=1 opencl=0 neon=1 cppthreads=1 benchmark_tests=1 -j3 Werror=0
export LD_LIBRARY_PATH=build/

Benchmarking alexnet

./build/tests/benchmark_graph_alexnet --pretty-file=alexnet.txt --iterations=20 --example_args="--threads=1" --instruments="wall_clock_timer_ms"

Version = arm_compute_version=v0.0-unreleased Build options: {'arch': 'arm64-v8a', 'debug': '0', 'benchmark': '1', 'benchmark_tests': '1', 'opencl': '           0', 'neon': '1', 'cppthreads': '1', 'Werror': '0'} Git hash=b'05e5644715c678773abaf180222a33959ee0dadb'
CommandLine = ./build/tests/benchmark_graph_alexnet --pretty-file=alexnet.txt --iterations=20 --example_ar           gs=--threads=4 --instruments=wall_clock_timer_ms
Iterations = 20
Running [0] 'Examples/benchmark_graph_alexnet'
Threads : 4
Target : NEON
Data type : F32
Data layout : NHWC
Tuner enabled? : false
Tuner file :
Fast math enabled? : false

  Wall clock/Wall clock time:    AVG=173.9102 ms, STDDEV=0.29 %, MIN=173.0490 ms, MAX=175.0760 ms, MEDIAN=174.0070 ms
Executed 1 test(s) (1 passed, 0 expected failures, 0 failed, 0 crashed, 0 disabled) in 8 second(s)

And here are more benchmark results.

squeezenet_v1_1:      AVG=36.8231 ms, STDDEV=15.15 %, MIN=20.9510 ms, MAX=39.5240 ms, MEDIAN=38.8860 ms
alexnet:              AVG=173.9102 ms, STDDEV=0.29 %, MIN=173.0490 ms, MAX=175.0760 ms, MEDIAN=174.0070 ms
vgg16:                AVG=1107.0216 ms, STDDEV=2.69 %, MIN=1051.7200 ms, MAX=1156.0880 ms, MEDIAN=1110.2560 ms
mobilenet_v2:         AVG=90.0225 ms, STDDEV=13.61 %, MIN=49.3990 ms, MAX=95.8760 ms, MEDIAN=94.8530 ms
resnet50:             AVG=221.2754 ms, STDDEV=7.51 %, MIN=163.8950 ms, MAX=236.7930 ms, MEDIAN=222.1410 ms
googlenet:            AVG=92.1642 ms, STDDEV=0.67 %, MIN=91.3500 ms, MAX=94.0260 ms, MEDIAN=92.1320 ms

The runtime results are way faster and do not seems to be real. Any idea please? Thanks.

Deepak

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

I understand what you’re asking, and I understand our graph examples might not be reliable, however as explained before: for validation of layers in isolation you have our arm_compute_validation test suites, if you want system tests (i.e entire networks), then I believe these should be provided by the official graph level API which in our case is ArmNN / AndroidNN. (And it should be able to load and run networks coming from other frameworks and therefore wouldn’t require the creation of a new zoo).

I’m not saying they currently provide these kind of suites, I’m just saying I believe it would be a better place for system level validation.

If it was simple to release the weights and images we would do it, unfortunately from a legal point of view it’s far from being straight forward.

In the meantime George is going to try to reproduce the issue internally and we’ll update this thread.

@AnthonyARM You need to make is easier for your developer community to test and benchmark your inference engine on different platforms not harder.

Our project (https://www.bonseyes.com/outcomes/) has already contributed Winograd convolution optimization to ARMCL (1.4x speed-up improvements of MobileNetV2) and there are more improvements that we can contribute however the testing framework of your library is limiting our ability to contribute.

Obviously releasing your regression tests would help rather than having the community build their own. However I am not referring to building a model zoo (https://github.com/onnx/models) where as you point out no body needs another zoo. However a regression test suite to ensure the combination of Device + OS + Drivers + Compiler has not introduced a accuracy regression is missing in your framework. I think it’s unreasonable to expect embedded developers to know the intricacies and nuances of model training and figure out how to interpret the output your API. You should be targetting a wider range of developers, those who “don’t know” the difference in padding between a Caffe and Tensorflow models and why the results will different given the “same” model architecture. BTW this is currently a bug in your current release.

Currently we are blocked on the 64bit issue for a publication we are due to publish this month under our H2020 project where we can’t get reliable numbers on your library on ARM64 and hence currently your library is 25% slower versus competition (NCNN) on average.

Thanks to consider releasing your internal regression tests as it would allow you to improve your project faster and accept more community contributions for improvement. What we need are:

  • The model weight files
  • The val.txt file
  • The image files
  • The expected Top-1 and Top-5 accuracy for each test

image

Thanks, Tim

Hello @bonseyes-admin, Created this patch https://review.mlplatform.org/#/c/ml/ComputeLibrary/+/390/ Can you check if this solves any reliability issues that you have?