tensorflow: Cannot use keras estimator_from_model() in distributed cluster

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): tf.VERSION = 1.4.0 tf.GIT_VERSION = v1.4.0-rc1-11-g130a514
Python version: 2.7.12
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: 8.0.61/6.0.21
GPU model and memory: NVIDIA Tesla M60 8 GB
Exact command to reproduce: See Below

Describe the problem

When trying to use an estimator that is derived from tf.keras.estimator.estimator_from_model() and training with tf.estimator.train_and_evaluate(), it will work as expected if in a standalone non-distributed session. However, when in a distributed training cluster and the TF_CONFIG has the cluster information set, there is a an explicit device assignment of an op to a device that is not valid in the current cluster spec.

Below is code to reproduce this issue. When simulate_cluster is set to True an error is throws as shown in the log below. When simulate_cluster is set to False the network is constructed and trained as intended. It should be noted that the error occurs when calling tf.keras.estimator.model_to_estimator(keras_model=model) and not when doing the training, the cluster config is required for the distributed training to take place.

The TF_CONFIG that is set below is derived from calling the code using the gcloud SDK as follows: gcloud ml-engine local train --distributed --parameter-server-count=1 --worker-count=2 --package-path=trainer --module-name=trainer.task --

Source code / logs

Minimal example:

import os
import numpy as np
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

simulate_cluster = True
if simulate_cluster:
    os.environ["TF_CONFIG"] = '{"environment": "cloud", "cluster": {"worker": ["localhost:27184", "localhost:27185"], \
               "ps": ["localhost:27183"], "master": ["localhost:27182"]}, "job": {"args": [""], \
               "job_name": "trainer.task"}, "task": {"index": 0, "type": "master"}}'
else:
    os.environ["TF_CONFIG"] = ''

inputs = tf.keras.layers.Input(shape=(10,))
outputs = tf.keras.layers.Dense(10)(inputs)
model = tf.keras.models.Model(inputs, outputs)
model.compile(optimizer='Adam', loss='binary_crossentropy')
est_keras = tf.keras.estimator.model_to_estimator(keras_model=model) # InvalidArgumentError thrown here if simulate_cluster is True

input_name = model.input_names[0]
data = np.random.rand(1000,10).astype(np.float32)
train_input_fn = tf.estimator.inputs.numpy_input_fn({input_name:data}, data, batch_size=10, num_epochs=None, shuffle=False)

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=100)
eval_spec = tf.estimator.EvalSpec(input_fn=train_input_fn, steps=10)
tf.estimator.train_and_evaluate(est_keras, train_spec, eval_spec)

InvalidArgumentError emitted when simulate_cluster = True:

Traceback (most recent call last):
  File "minimal.py", line 19, in <module>
    est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 217, in _save_first_checkpoint
    model.set_weights(keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 766, in set_weights
    K.batch_set_value(tuples)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 2406, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 376, in get_session
    _initialize_variables(session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 554, in _initialize_variables
    [variables_module.is_variable_initialized(v) for v in candidate_vars])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
         [[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]

Caused by op u'loss/dense_1_loss/sub', defined at:
  File "minimal.py", line 19, in <module>
    est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 209, in _save_first_checkpoint
    custom_objects)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 124, in _clone_and_build_model
    target_tensors=target_tensors)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 840, in compile
    output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 444, in weighted
    score_array = fn(y_true, y_pred)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/losses.py", line 78, in binary_crossentropy
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 3027, in binary_crossentropy
    output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 910, in r_binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4636, in _sub
    "Sub", x=x, y=y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
         [[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]

Full logs, tf_env, and more are here: https://gist.github.com/droidicus/2abd4ddad81a1e9169a1c7a100057b15

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 19 (10 by maintainers)

Commits related to this issue

Don't assign device for the keras part of _saved_first_checkpoint. Fix #14504. (#17231) PiperOrigin-RevId: 186526175 — committed to tensorflow/tensorflow by yifeif 6 years ago
Don't assign device for the keras part of _saved_first_checkpoint. Fix #14504. (#17231) PiperOrigin-RevId: 186526175 — committed to StanislawAntol/tensorflow by yifeif 6 years ago

Most upvoted comments

Any update on this? Having an example that use keras + multi gpu + model_to_estimator + distributed cluster seems in my opinion pretty important for anyone the want to do production work with google cloud ml.

zippeurfou on Mar 7, 2018

We are seeing this same issue with model_to_estimator, and it is currently blocking a large project.

Any update?

djrut on Jan 31, 2018