tensorflow: Cannot use keras estimator_from_model() in distributed cluster
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): tf.VERSION = 1.4.0 tf.GIT_VERSION = v1.4.0-rc1-11-g130a514
- Python version: 2.7.12
- Bazel version (if compiling from source): N/A
- GCC/Compiler version (if compiling from source): N/A
- CUDA/cuDNN version: 8.0.61/6.0.21
- GPU model and memory: NVIDIA Tesla M60 8 GB
- Exact command to reproduce: See Below
Describe the problem
When trying to use an estimator that is derived from tf.keras.estimator.estimator_from_model()
and training with tf.estimator.train_and_evaluate()
, it will work as expected if in a standalone non-distributed session. However, when in a distributed training cluster and the TF_CONFIG has the cluster information set, there is a an explicit device assignment of an op to a device that is not valid in the current cluster spec.
Below is code to reproduce this issue. When simulate_cluster
is set to True an error is throws as shown in the log below. When simulate_cluster
is set to False the network is constructed and trained as intended. It should be noted that the error occurs when calling tf.keras.estimator.model_to_estimator(keras_model=model)
and not when doing the training, the cluster config is required for the distributed training to take place.
The TF_CONFIG that is set below is derived from calling the code using the gcloud SDK as follows:
gcloud ml-engine local train --distributed --parameter-server-count=1 --worker-count=2 --package-path=trainer --module-name=trainer.task --
Source code / logs
Minimal example:
import os
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
simulate_cluster = True
if simulate_cluster:
os.environ["TF_CONFIG"] = '{"environment": "cloud", "cluster": {"worker": ["localhost:27184", "localhost:27185"], \
"ps": ["localhost:27183"], "master": ["localhost:27182"]}, "job": {"args": [""], \
"job_name": "trainer.task"}, "task": {"index": 0, "type": "master"}}'
else:
os.environ["TF_CONFIG"] = ''
inputs = tf.keras.layers.Input(shape=(10,))
outputs = tf.keras.layers.Dense(10)(inputs)
model = tf.keras.models.Model(inputs, outputs)
model.compile(optimizer='Adam', loss='binary_crossentropy')
est_keras = tf.keras.estimator.model_to_estimator(keras_model=model) # InvalidArgumentError thrown here if simulate_cluster is True
input_name = model.input_names[0]
data = np.random.rand(1000,10).astype(np.float32)
train_input_fn = tf.estimator.inputs.numpy_input_fn({input_name:data}, data, batch_size=10, num_epochs=None, shuffle=False)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=100)
eval_spec = tf.estimator.EvalSpec(input_fn=train_input_fn, steps=10)
tf.estimator.train_and_evaluate(est_keras, train_spec, eval_spec)
InvalidArgumentError emitted when simulate_cluster = True
:
Traceback (most recent call last):
File "minimal.py", line 19, in <module>
est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
_save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 217, in _save_first_checkpoint
model.set_weights(keras_weights)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 766, in set_weights
K.batch_set_value(tuples)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 2406, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 376, in get_session
_initialize_variables(session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 554, in _initialize_variables
[variables_module.is_variable_initialized(v) for v in candidate_vars])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
[[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]
Caused by op u'loss/dense_1_loss/sub', defined at:
File "minimal.py", line 19, in <module>
est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
_save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 209, in _save_first_checkpoint
custom_objects)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 124, in _clone_and_build_model
target_tensors=target_tensors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 840, in compile
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 444, in weighted
score_array = fn(y_true, y_pred)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/losses.py", line 78, in binary_crossentropy
return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 3027, in binary_crossentropy
output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 910, in r_binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4636, in _sub
"Sub", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
[[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]
Full logs, tf_env, and more are here: https://gist.github.com/droidicus/2abd4ddad81a1e9169a1c7a100057b15
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 19 (10 by maintainers)
Commits related to this issue
- Don't assign device for the keras part of _saved_first_checkpoint. Fix #14504. (#17231) PiperOrigin-RevId: 186526175 — committed to tensorflow/tensorflow by yifeif 6 years ago
- Don't assign device for the keras part of _saved_first_checkpoint. Fix #14504. (#17231) PiperOrigin-RevId: 186526175 — committed to StanislawAntol/tensorflow by yifeif 6 years ago
Any update on this? Having an example that use keras + multi gpu + model_to_estimator + distributed cluster seems in my opinion pretty important for anyone the want to do production work with google cloud ml.
We are seeing this same issue with model_to_estimator, and it is currently blocking a large project.
Any update?