tensorflow: tf.keras.estimator.model_to_estimator crashing when using tf.distribute.MirroredStrategy() with Only TensorFlow native optimizers are supported with DistributionStrategy
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below): 2.0.0-alpha0
- Python version: 3.6
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
Describe the current behavior
I am using a keras model with TF 2.0. I am converting the model to estimator:
tf.keras.estimator.model_to_estimator
then it is crashing when running estimator_train_model.train(..)
when using tf.distribute.MirroredStrategy()
(after migrating the code to TF 2.0 of course).
It works fine when using None
as strategy
I tried to follow the instruction: https://www.tensorflow.org/alpha/guide/distribute_strategy
Describe the expected behavior The same was working with TF 1.x
Code to reproduce the issue Work in progress notebook can be found here: http://localhost:8888/notebooks/proj_DL_models_and_pipelines_with_GCP/notebook/TF_2.0/08-Mnist_keras_estimator.ipynb
I am using a very basic Keras model with
optimiser = tf.keras.optimizers.Adam(lr=0.01, beta_1=0.9, epsilon=1e-07)
# Compile model
model.compile(loss='categorical_crossentropy',
optimizer=optimiser,
metrics=['accuracy'])
strategy=None # working
#strategy = tf.distribute.MirroredStrategy() # crashing with TF 2.0 but working with TF 1.X
# config tf.estimator to use a give strategy
training_config = tf.estimator.RunConfig(train_distribute=strategy,
model_dir=FLAGS.model_dir,
save_summary_steps=1,
save_checkpoints_steps=100,
keep_checkpoint_max=3,
log_step_count_steps=10)
# transfor keras model to estimator model
estimator_train_model = tf.keras.estimator.model_to_estimator(keras_model=model_opt_tf,
config=training_config)
# Fit the model (using estimator.train and data.Dataset)
estimator_train_model.train(input_fn=lambda:mnist_v1.input_mnist_tfrecord_dataset_fn(path_train_tfrecords+'*', FLAGS, mode=tf.estimator.ModeKeys.TRAIN, batch_size=FLAGS.batch_size),
steps=1000)
Other info / logs I0409 22:06:27.293305 4531660224 model.py:211] input_dataset_fn: TRAIN, train I0409 22:06:27.469371 123145492971520 estimator.py:1126] Calling model_fn. I0409 22:06:27.475003 123145492971520 coordinator.py:219] Error reported to Coordinator: Only TensorFlow native optimizers are supported with DistributionStrategy. Traceback (most recent call last): File “/Users/tarrade/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py”, line 297, in stop_on_exception yield File “/Users/tarrade/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py”, line 882, in run self.main_result = self.main_fn(*self.main_args, **self.main_kwargs) File “/Users/tarrade/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1127, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File “/Users/tarrade/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/keras.py”, line 278, in model_fn raise ValueError('Only TensorFlow native optimizers are supported with ’ ValueError: Only TensorFlow native optimizers are supported with DistributionStrategy.
ValueError Traceback (most recent call last) <timed eval> in <module>
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 357 358 saving_listeners = _check_listeners_type(saving_listeners) –> 359 loss = self._train_model(input_fn, hooks, saving_listeners) 360 logging.info(‘Loss for final step: %s.’, loss) 361 return self
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners) 1135 def _train_model(self, input_fn, hooks, saving_listeners): 1136 if self._train_distribution: -> 1137 return self._train_model_distributed(input_fn, hooks, saving_listeners) 1138 else: 1139 return self._train_model_default(input_fn, hooks, saving_listeners)
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model_distributed(self, input_fn, hooks, saving_listeners) 1198 self._config._train_distribute.configure(self._config.session_config) 1199 return self._actual_train_model_distributed( -> 1200 self._config._train_distribute, input_fn, hooks, saving_listeners) 1201 # pylint: enable=protected-access 1202
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _actual_train_model_distributed(self, strategy, input_fn, hooks, saving_listeners) 1267 labels, # although this will be None it seems 1268 ModeKeys.TRAIN, -> 1269 self.config)) 1270 loss = strategy.reduce(reduce_util.ReduceOp.SUM, 1271 grouped_estimator_spec.loss)
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py in call_for_each_replica(self, fn, args, kwargs) 1076 if kwargs is None: 1077 kwargs = {} -> 1078 return self._call_for_each_replica(fn, args, kwargs) 1079 1080 def _call_for_each_replica(self, fn, args, kwargs):
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py in _call_for_each_replica(self, fn, args, kwargs) 663 def _call_for_each_replica(self, fn, args, kwargs): 664 return _call_for_each_replica(self._container_strategy(), self._device_map, –> 665 fn, args, kwargs) 666 667 def _configure(self,
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py in _call_for_each_replica(distribution, device_map, fn, args, kwargs) 191 for t in threads: 192 t.should_run.set() –> 193 coord.join(threads) 194 195 return values.regroup(device_map, tuple(t.main_result for t in threads))
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py in join(self, threads, stop_grace_period_secs, ignore_live_threads) 387 self._registered_threads = set() 388 if self._exc_info_to_raise: –> 389 six.reraise(*self._exc_info_to_raise) 390 elif stragglers: 391 if ignore_live_threads:
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) –> 693 raise value 694 finally: 695 value = None
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py in stop_on_exception(self) 295 “”" 296 try: –> 297 yield 298 except: # pylint: disable=bare-except 299 self.request_stop(ex=sys.exc_info())
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py in run(self)
880 self._captured_var_scope, reuse=self.replica_id > 0),
881 variable_scope.variable_creator_scope(self.variable_creator_fn):
–> 882 self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
883 self.done = True
884 finally:
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config) 1125 1126 logging.info(‘Calling model_fn.’) -> 1127 model_fn_results = self._model_fn(features=features, **kwargs) 1128 logging.info(‘Done calling model_fn.’) 1129
~/anaconda3/envs/env_gcp_dl_2_0_alpha/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/keras.py in model_fn(features, labels, mode) 276 not isinstance(keras_model.optimizer, 277 (tf_optimizer_module.Optimizer, optimizers.TFOptimizer)): –> 278 raise ValueError('Only TensorFlow native optimizers are supported with ’ 279 ‘DistributionStrategy.’) 280
ValueError: Only TensorFlow native optimizers are supported with DistributionStrategy.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (4 by maintainers)
This appears to be a real issue. We flagged it internally and someone will hopefully get to it soon. Thanks for reporting.
If you are curious about fixing it, then my guess is that
keras.optimizer_v2.OptimizerV2
needs to be allowed in the set of classes that are allowed right where that ValueError is thrown. It should work.For anyone facing this issue, you can simply change the optimizer to be from tf.train instead. This will be a compatible optimizer with tf.keras code (worked for me on 1.13).
I found this post helpful on the subject:
https://medium.com/tensorflow/multi-gpu-training-with-estimators-tf-keras-and-tf-data-ba584c3134db
The fix is in estimator: https://github.com/tensorflow/estimator/commit/818c6163ba759dc257b7c27421bc46ef8259e217#diff-93ed941ec864a6ef78bd656293c99cea
Sometimes the version of estimator pip package might be stale. Would you mind updating the estimator package directly and testing then?
https://pypi.org/project/tf-estimator-nightly/
I have the same problem - on colab this is my code https://colab.research.google.com/drive/1mf-PK0a20CkObnT0hCl9VPEje1szhHat#scrollTo=MMbPOC3f5ku3