tensorflow: TF 2.0 regression: cloudpickle cannot serialize tf.keras.Sequential.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (code included below in the issue)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.14.3
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): pip
TensorFlow version (use command below): v2.0.0-beta1-5101-gc75bb66a99 2.0.0-rc0
Python version: Python 3.6.7 :: Anaconda, Inc.
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A

Using cloudpickle to serialize a Python function that uses tf.keras.Sequential fails with a recursion error.

Note that this works with tensorflow==1.14.0.

I imagine it also fails with other things, not just tf.keras.Sequential.

import cloudpickle  # cloudpickle.__version__ == '1.2.1'
import tensorflow as tf  # tf.__version__ == '2.0.0-rc0'

def f():
    tf.keras.Sequential

cloudpickle.loads(cloudpickle.dumps(f))  # This fails.

The last line fails with

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-23-25cc307e6227> in <module>
----> 1 cloudpickle.loads(cloudpickle.dumps(f))

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in _load(self)
     42   def _load(self):
     43     """Import the target module and insert it into the parent's namespace."""
---> 44     module = _importlib.import_module(self.__name__)
     45     self._parent_module_globals[self._local_name] = module
     46     self.__dict__.update(module.__dict__)

... last 2 frames repeated, from the frame below ...

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

RecursionError: maximum recursion depth exceeded while calling a Python object

See https://stackoverflow.com/questions/57750920/ray-tensorflow-gpu-2-0-recursionerror/57761034#57761034

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 27 (3 by maintainers)

Commits related to this issue

Add __reduce__ method in virtual pip root due to lazy loading Fixes pickling issue #32159 Tested manually: just applied patch to tensorflow nightly and checked required imports. PiperOrigin-RevId: ... — committed to tensorflow/tensorflow by mihaimaruseac 5 years ago
Add __reduce__ method in virtual pip root due to lazy loading Fixes pickling issue #32159 Tested manually: just applied patch to tensorflow nightly and checked required imports. PiperOrigin-RevId: ... — committed to advaitjain/tensorflow by mihaimaruseac 5 years ago
Add __reduce__ method in virtual pip root due to lazy loading Fixes pickling issue #32159 Tested manually: just applied patch to tensorflow nightly and checked required imports. PiperOrigin-RevId: ... — committed to ssheikholeslami/tensorflow by mihaimaruseac 5 years ago

Most upvoted comments

@mihaimaruseac I opened https://github.com/tensorflow/tensorflow/pull/39034 for this.

ssheikholeslami on Apr 29, 2020

I think this can be closed now as it has been solved and backported to 1.15 too.

mihaimaruseac on Jun 20, 2020

Any updates on this?

kazimuth on Mar 19, 2020

No, but if you want to make a cherry-pick we can merge it if and when we do a new patch release on 1.15

mihaimaruseac on Apr 20, 2020

Seems the fix works with Ray. However if we use custom layers with functions decorated with @tf.function there are still pickling issues. As a workaround for that I figured one could save the model as a “savedmodel” on a distributed storage and then have the ray worker load the model from the distributed storage, but this throws an error.

Note: Removing the LSTM layer does not result in an error, which would suggest that this error is related to the while operation (as the error suggests).

LookupError: No gradient defined for operation 'while' (op type: While)

Code to reproduce

import tensorflow as tf
import ray 
import numpy as np

ray.init()

def build_save_model():
    lstm_in = tf.keras.Input(shape=(24,1))
    lstm_out = tf.keras.layers.LSTM(6)(lstm_in)
    dense_out = tf.keras.layers.Dense(24)(lstm_out)
    model = tf.keras.Model([lstm_in], dense_out)
    model.save('/path/in/common/storage/lstm_model')

@ray.remote
class Worker():
    def __init__(self):
        self.model = tf.keras.models.load_model('/path/in/common/storage/lstm_model')
        self.model.compile(optimizer=tf.keras.optimizers.Adam(1e-1), loss=tf.keras.losses.mse)
        self.data = np.arange(24).reshape(1,24,1)
        self.label = np.arange(24).reshape(1,24)
        
    def train(self):
        history = self.model.fit(self.data, self.label, epochs=10)
        return history.history
        
build_save_model()
lstm_worker = Worker.remote()
w = ray.get(lstm_worker.train.remote())

Error

---------------------------------------------------------------------------
RayTaskError                              Traceback (most recent call last)
<ipython-input-3-a18941ca631a> in <module>
     22 build_save_model()
     23 lstm_worker = Worker.remote()
---> 24 w = ray.get(lstm_worker.train.remote())

/opt/conda/lib/python3.6/site-packages/ray/worker.py in get(object_ids)
   2245             if isinstance(value, RayError):
   2246                 last_task_error_raise_time = time.time()
-> 2247                 raise value
   2248 
   2249         # Run post processors.

RayTaskError: ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2326, in get_attr
    c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 331, in _MaybeCompile
    xla_compile = op.get_attr("_XlaCompile")
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2330, in get_attr
    raise ValueError(str(e))
ValueError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2326, in get_attr
    c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 331, in _MaybeCompile
    xla_compile = op.get_attr("_XlaCompile")
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2330, in get_attr
    raise ValueError(str(e))
ValueError: Operation 'StatefulPartitionedCall' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 607, in _GradientsHelper
    grad_fn = ops.get_gradient_function(op)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 2495, in get_gradient_function
    return _gradient_registry.lookup(op_type)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/registry.py", line 97, in lookup
    "%s registry has no entry for: %s" % (self._name, name))
LookupError: gradient registry has no entry for: While

During handling of the above exception, another exception occurred:

ray_worker (pid=1397, host=thesis-clustering-7dfb7867df-pk5fc)
  File "<ipython-input-3-a18941ca631a>", line 19, in train
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 785, in fit
    use_multiprocessing=use_multiprocessing)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 337, in fit
    total_epochs=epochs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 127, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    *args, **kwds))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2366, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2675, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2565, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(x, y, sample_weights))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 264, in train_on_batch
    output_loss_metrics=model._output_loss_metrics)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
    output_loss_metrics=output_loss_metrics))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 269, in _process_single_batch
    grads = tape.gradient(scaled_total_loss, trainable_weights)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/backprop.py", line 1029, in gradient
    unconnected_gradients=unconnected_gradients)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/imperative_grad.py", line 77, in imperative_grad
    compat.as_str(unconnected_gradients.value))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 766, in _backward_function
    return self._rewrite_forward_and_call_backward(call_op, *args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 685, in _rewrite_forward_and_call_backward
    forward_function, backwards_function = self.forward_backward(len(doutputs))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 594, in forward_backward
    forward, backward = self._construct_forward_backward(num_doutputs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 642, in _construct_forward_backward
    func_graph=backwards_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 974, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 632, in _backprop_function
    src_graph=self._func_graph)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/ops/gradients_util.py", line 623, in _GradientsHelper
    (op.name, op.type))
LookupError: No gradient defined for operation 'while' (op type: While)

jharaldson on Nov 1, 2019

@jharaldson the easiest workaround might be the one described in https://github.com/ray-project/ray/issues/5614#issuecomment-527292289.

Another workaround is described in https://stackoverflow.com/a/57761034/7858504

robertnishihara on Oct 23, 2019

@mihaimaruseac I saw https://github.com/tensorflow/tensorflow/commit/4675891bd3c9e9ee7a57552486ec5bdc40379787 . Is it relevant to this issue?

mengxr on Oct 15, 2019

I have tried on colab with TF 1.14 and able to execute the code.However i am able to reproduce the issue with TF 2.0.0-rc0 and 2.0 nightly versions.Please, find the gist here.Thanks!

ravikyram on Sep 3, 2019