tensorflow: hang in recursive call of google::protobuf::DescriptorPool::FindFileByName

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 2.3.0
  • Python version: 3.8.0
  • Bazel version (if compiling from source): 3.1
  • GCC/Compiler version (if compiling from source): 5.4
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: -

Describe the current behavior

TensorFlow hangs at import (import tensorflow). The only output I see is:

2020-08-04 17:42:26.693993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

Describe the expected behavior

TensorFlow should not hang at import.

Standalone code to reproduce the issue

import tensorflow

Other info / logs

I attached GDB, and the stacktrace is interesting. I posted it here.

The relevant part:

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff7620dbd in __GI___pthread_mutex_lock (mutex=0xe6cfb0) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fffd03f5d91 in google::protobuf::DescriptorPool::FindFileByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#3  0x00007fffd044328e in google::protobuf::(anonymous namespace)::AssignDescriptorsImpl(google::protobuf::internal::DescriptorTable const*) ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#4  0x00007ffff7625a99 in __pthread_once_slow (
    once_control=0x7fffd0ee5d74 <descriptor_table_google_2fprotobuf_2fdescriptor_2eproto_once>, 
    init_routine=0x7fffed339ac0 <std::__once_proxy()>) at pthread_once.c:116
#5  0x00007fffd0436d46 in google::protobuf::internal::AssignDescriptors(google::protobuf::internal::DescriptorTable const*)
    ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#6  0x00007fffd0405590 in google::protobuf::FileOptions::GetMetadata() const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#7  0x00007fffd0461e98 in google::protobuf::Message::GetTypeName[abi:cxx11]() const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#8  0x00007fffd050a178 in google::protobuf::(anonymous namespace)::ByteSizeConsistencyError(unsigned long, unsigned long, unsigned long, google::protobuf::MessageLite const&) [clone .constprop.38] ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#9  0x00007fffd050b61c in google::protobuf::MessageLite::AppendPartialToString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#10 0x00007fffd050b92e in google::protobuf::MessageLite::SerializeAsString[abi:cxx11]() const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#11 0x00007fffd03ecd12 in void google::protobuf::DescriptorBuilder::AllocateOptionsImpl<google::protobuf::FileDescriptor>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::FileDescriptor::OptionsType const&, google::protobuf::FileDescriptor*, std::vector<int, std::allocator<int> > const&) ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#12 0x00007fffd03ed02c in google::protobuf::DescriptorBuilder::AllocateOptions(google::protobuf::FileOptions const&, google::protobuf::FileDescriptor*) ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#13 0x00007fffd03f495b in google::protobuf::DescriptorBuilder::BuildFileImpl(google::protobuf::FileDescriptorProto const&)
    ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#14 0x00007fffd03f57de in google::protobuf::DescriptorBuilder::BuildFile(google::protobuf::FileDescriptorProto const&) ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#15 0x00007fffd03f5be5 in google::protobuf::DescriptorPool::BuildFileFromDatabase(google::protobuf::FileDescriptorProto const&) const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#16 0x00007fffd03f5d3b in google::protobuf::DescriptorPool::TryFindFileInFallbackDatabase(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#17 0x00007fffd03f5e34 in google::protobuf::DescriptorPool::FindFileByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#18 0x00007fffd044328e in google::protobuf::(anonymous namespace)::AssignDescriptorsImpl(google::protobuf::internal::DescriptorTable const*) ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#19 0x00007ffff7625a99 in __pthread_once_slow (
    once_control=0x7fffd0ee9354 <descriptor_table_tensorflow_2fcore_2fframework_2fop_5fdef_2eproto_once>, 
    init_routine=0x7fffed339ac0 <std::__once_proxy()>) at pthread_once.c:116
#20 0x00007fffd0436d46 in google::protobuf::internal::AssignDescriptors(google::protobuf::internal::DescriptorTable const*)
    ()
   from /work/tools/asr/python/3.8.0_tf_2.3-v1-generic+cuda10.1/lib/python3.8/site-packages/tensorflow/python/../libtensorflow_framework.so.2
#21 0x00007fffd059ea10 in tensorflow::OpList::GetMetadata() const ()

I noticed that google::protobuf::DescriptorPool::FindFileByName occurs twice in this stacktrace. Maybe google::protobuf::DescriptorPool::FindFileByName locks the same mutex twice? That would explain the hang. This is the only thread running at this point.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

@mohantym Thanks for the response.

Yes, we still need this.

Your linked issues are not related to this.

Also, from the current information, this might be some race condition, and just out of luck it does not get triggered with newer GCC versions. This would be a serious issue and should be investigated.