tensorflow: Cannot use keras estimator_from_model() in distributed cluster

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): tf.VERSION = 1.4.0 tf.GIT_VERSION = v1.4.0-rc1-11-g130a514
  • Python version: 2.7.12
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 8.0.61/6.0.21
  • GPU model and memory: NVIDIA Tesla M60 8 GB
  • Exact command to reproduce: See Below

Describe the problem

When trying to use an estimator that is derived from tf.keras.estimator.estimator_from_model() and training with tf.estimator.train_and_evaluate(), it will work as expected if in a standalone non-distributed session. However, when in a distributed training cluster and the TF_CONFIG has the cluster information set, there is a an explicit device assignment of an op to a device that is not valid in the current cluster spec.

Below is code to reproduce this issue. When simulate_cluster is set to True an error is throws as shown in the log below. When simulate_cluster is set to False the network is constructed and trained as intended. It should be noted that the error occurs when calling tf.keras.estimator.model_to_estimator(keras_model=model) and not when doing the training, the cluster config is required for the distributed training to take place.

The TF_CONFIG that is set below is derived from calling the code using the gcloud SDK as follows: gcloud ml-engine local train --distributed --parameter-server-count=1 --worker-count=2 --package-path=trainer --module-name=trainer.task --

Source code / logs

Minimal example:

import os
import numpy as np
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

simulate_cluster = True
if simulate_cluster:
    os.environ["TF_CONFIG"] = '{"environment": "cloud", "cluster": {"worker": ["localhost:27184", "localhost:27185"], \
               "ps": ["localhost:27183"], "master": ["localhost:27182"]}, "job": {"args": [""], \
               "job_name": "trainer.task"}, "task": {"index": 0, "type": "master"}}'
else:
    os.environ["TF_CONFIG"] = ''

inputs = tf.keras.layers.Input(shape=(10,))
outputs = tf.keras.layers.Dense(10)(inputs)
model = tf.keras.models.Model(inputs, outputs)
model.compile(optimizer='Adam', loss='binary_crossentropy')
est_keras = tf.keras.estimator.model_to_estimator(keras_model=model) # InvalidArgumentError thrown here if simulate_cluster is True

input_name = model.input_names[0]
data = np.random.rand(1000,10).astype(np.float32)
train_input_fn = tf.estimator.inputs.numpy_input_fn({input_name:data}, data, batch_size=10, num_epochs=None, shuffle=False)

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=100)
eval_spec = tf.estimator.EvalSpec(input_fn=train_input_fn, steps=10)
tf.estimator.train_and_evaluate(est_keras, train_spec, eval_spec)

InvalidArgumentError emitted when simulate_cluster = True:

Traceback (most recent call last):
  File "minimal.py", line 19, in <module>
    est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 217, in _save_first_checkpoint
    model.set_weights(keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/topology.py", line 766, in set_weights
    K.batch_set_value(tuples)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 2406, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 376, in get_session
    _initialize_variables(session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 554, in _initialize_variables
    [variables_module.is_variable_initialized(v) for v in candidate_vars])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
         [[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]

Caused by op u'loss/dense_1_loss/sub', defined at:
  File "minimal.py", line 19, in <module>
    est_keras = tf.keras.estimator.model_to_estimator(keras_model=model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 280, in model_to_estimator
    _save_first_checkpoint(keras_model, est, custom_objects, keras_weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 209, in _save_first_checkpoint
    custom_objects)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/estimator.py", line 124, in _clone_and_build_model
    target_tensors=target_tensors)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 840, in compile
    output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/engine/training.py", line 444, in weighted
    score_array = fn(y_true, y_pred)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/losses.py", line 78, in binary_crossentropy
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 3027, in binary_crossentropy
    output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 910, in r_binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 4636, in _sub
    "Sub", x=x, y=y, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'loss/dense_1_loss/sub': Operation was explicitly assigned to /job:master/task:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3 ]. Make sure the device specification refers to a valid device.
         [[Node: loss/dense_1_loss/sub = Sub[T=DT_FLOAT, _device="/job:master/task:0"](loss/dense_1_loss/sub/x, loss/dense_1_loss/Const)]]

Full logs, tf_env, and more are here: https://gist.github.com/droidicus/2abd4ddad81a1e9169a1c7a100057b15

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 19 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Any update on this? Having an example that use keras + multi gpu + model_to_estimator + distributed cluster seems in my opinion pretty important for anyone the want to do production work with google cloud ml.

We are seeing this same issue with model_to_estimator, and it is currently blocking a large project.

Any update?