tensorflow: TF 2.0 regression: cloudpickle cannot serialize tf.keras.Sequential.

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (code included below in the issue)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.14.3
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): v2.0.0-beta1-5101-gc75bb66a99 2.0.0-rc0
  • Python version: Python 3.6.7 :: Anaconda, Inc.
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

Using cloudpickle to serialize a Python function that uses tf.keras.Sequential fails with a recursion error.

Note that this works with tensorflow==1.14.0.

I imagine it also fails with other things, not just tf.keras.Sequential.

import cloudpickle  # cloudpickle.__version__ == '1.2.1'
import tensorflow as tf  # tf.__version__ == '2.0.0-rc0'

def f():
    tf.keras.Sequential

cloudpickle.loads(cloudpickle.dumps(f))  # This fails.

The last line fails with

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-23-25cc307e6227> in <module>
----> 1 cloudpickle.loads(cloudpickle.dumps(f))

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in _load(self)
     42   def _load(self):
     43     """Import the target module and insert it into the parent's namespace."""
---> 44     module = _importlib.import_module(self.__name__)
     45     self._parent_module_globals[self._local_name] = module
     46     self.__dict__.update(module.__dict__)

... last 2 frames repeated, from the frame below ...

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

RecursionError: maximum recursion depth exceeded while calling a Python object

See https://stackoverflow.com/questions/57750920/ray-tensorflow-gpu-2-0-recursionerror/57761034#57761034

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 27 (3 by maintainers)

Commits related to this issue

Most upvoted comments

I think this can be closed now as it has been solved and backported to 1.15 too.

Any updates on this?

No, but if you want to make a cherry-pick we can merge it if and when we do a new patch release on 1.15

Seems the fix works with Ray. However if we use custom layers with functions decorated with @tf.function there are still pickling issues. As a workaround for that I figured one could save the model as a “savedmodel” on a distributed storage and then have the ray worker load the model from the distributed storage, but this throws an error.

Note: Removing the LSTM layer does not result in an error, which would suggest that this error is related to the while operation (as the error suggests).

LookupError: No gradient defined for operation 'while' (op type: While)

Code to reproduce

import tensorflow as tf
import ray 
import numpy as np

ray.init()

def build_save_model():
    lstm_in = tf.keras.Input(shape=(24,1))
    lstm_out = tf.keras.layers.LSTM(6)(lstm_in)
    dense_out = tf.keras.layers.Dense(24)(lstm_out)
    model = tf.keras.Model([lstm_in], dense_out)
    model.save('/path/in/common/storage/lstm_model')

@ray.remote
class Worker():
    def __init__(self):
        self.model = tf.keras.models.load_model('/path/in/common/storage/lstm_model')
        self.model.compile(optimizer=tf.keras.optimizers.Adam(1e-1), loss=tf.keras.losses.mse)
        self.data = np.arange(24).reshape(1,24,1)
        self.label = np.arange(24).reshape(1,24)
        
    def train(self):
        history = self.model.fit(self.data, self.label, epochs=10)
        return history.history
        
build_save_model()
lstm_worker = Worker.remote()
w = ray.get(lstm_worker.train.remote())

Error

---------------------------------------------------------------------------
RayTaskError                              Traceback (most recent call last)
<ipython-input-3-a18941ca631a> in <module>
     22 build_save_model()
     23 lstm_worker = Worker.remote()
---> 24 w = ray.get(lstm_worker.train.remote())

/opt/conda/lib/python3.6/site-packages/ray/worker.py in get(object_ids)
   2245             if isinstance(value, RayError):
   2246                 last_task_error_raise_time = time.time()
-> 2247                 raise value
   2248 
   2249         # Run post processors.

RayTaskError: ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2326, in get_attr
    c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 331, in _MaybeCompile
    xla_compile = op.get_attr("_XlaCompile")
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2330, in get_attr
    raise ValueError(str(e))
ValueError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2326, in get_attr
    c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 331, in _MaybeCompile
    xla_compile = op.get_attr("_XlaCompile")
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2330, in get_attr
    raise ValueError(str(e))
ValueError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 607, in _GradientsHelper
    grad_fn = ops.get_gradient_function(op)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2495, in get_gradient_function
    return _gradient_registry.lookup(op_type)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/registry.py", line 97, in lookup
    "%s registry has no entry for: %s" % (self._name, name))
LookupError: gradient registry has no entry for: While

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "<ipython-input-3-a18941ca631a>", line 19, in train
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 785, in fit
    use_multiprocessing=use_multiprocessing)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 337, in fit
    total_epochs=epochs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 127, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    *args, **kwds))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2366, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2675, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2565, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(x, y, sample_weights))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 264, in train_on_batch
    output_loss_metrics=model._output_loss_metrics)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
    output_loss_metrics=output_loss_metrics))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 269, in _process_single_batch
    grads = tape.gradient(scaled_total_loss, trainable_weights)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/backprop.py", line 1029, in gradient
    unconnected_gradients=unconnected_gradients)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/imperative_grad.py", line 77, in imperative_grad
    compat.as_str(unconnected_gradients.value))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 766, in _backward_function
    return self._rewrite_forward_and_call_backward(call_op, *args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 623, in _GradientsHelper
    (op.name, op.type))
LookupError: No gradient defined for operation 'while' (op type: While)

@jharaldson the easiest workaround might be the one described in https://github.com/ray-project/ray/issues/5614#issuecomment-527292289.

Another workaround is described in https://stackoverflow.com/a/57761034/7858504

I have tried on colab with TF 1.14 and able to execute the code.However i am able to reproduce the issue with TF 2.0.0-rc0 and 2.0 nightly versions.Please, find the gist here.Thanks!