tensorflow: tflite gpu delegate create and load model use v2 api is very slow compare with v1 api(10x) why ?
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): android
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below): 2.2 rc2
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
i compile tflite.a (2.2 rc2) from source and use ndk c++ api to run tflite model as follow: ` #ifdef V2 TfLiteGpuDelegateOptionsV2 tOptions = TfLiteGpuDelegateOptionsV2Default(); if (m_bGPUAllowFP16) { tOptions.is_precision_loss_allowed = 1; } tOptions.inference_preference = 1;
m_pGPUDelegate = TfLiteGpuDelegateV2Create(&tOptions);
#else TfLiteGpuDelegateOptions tOptions = {.metadata = nullptr, .compile_options = {.precision_loss_allowed = 0, .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST, .dynamic_batch_enabled = 0,},}; if (m_bGPUAllowFP16) { tOptions.compile_options.precision_loss_allowed = 1; }
m_pGPUDelegate = TfLiteGpuDelegateCreate(&tOptions);
#endif
auto iRetCode = m_pInterp->ModifyGraphWithDelegate(m_pGPUDelegate);
if (iRetCode != kTfLiteOk)
{
return -1;
}
`
but the time cost is very different, v1 load time cost is only 10% of v2 load time. the model has Conv2DTranspose op, if use v1 api the inference time is 4x of v2 api, so why has this performance different?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (14 by maintainers)
Is there a way to precompile and cache the OpenCL kernels? Maybe as part of the tflite export? Or maybe the GPU delegate can cache the kernels locally on the device for subsequent runs?
+1
@impjdi hi impjdi, could you give a example? it’s very useful to save init time!! mycode as follows, but i don’t how to use Encode API in serialization.h.
Thanks!
While this functionality is present (was just added about a month ago), it doesn’t follow the paths of the GPU delegate, and there is a lot more involved in getting the plumbing done. You can take a look at
tensorflow/lite/delegates/gpu/cl/serialization.h
and make your experiment with it (sorry, no official documentation or support). Note that the generated cache binary is not universal, i.e. it differs by mobile vendor, device, OS version, and GPU driver. So for each new model, you would have to run & generate it once on that particular user device and store it. You also need to know when you’re going to flush it with a new OS version, GPU driver, and your ML model, so you’ll need quite a lot of logic around this.