tensorflow: Distributed training fails when I use CollectiveAllReduceStrategy

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I slightly updated mnist.py example so that it uses CollectiveAllReduceStrategy. Updated version is here.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS 10.13.3
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): 1.11.0-dev20180913
Python version: 3.6.3
Exact command to reproduce: See updated example.

Describe the problem

Hi,

I’m trying to update mnist model from official repository so that it uses CollectiveAllReduceStrategy as it’s shown in keras_model_to_estimator_client.py. Updated example you can find here. Unfortunately, it fails on deepcopy of run config.

Source code / logs

Traceback (most recent call last):
  File "mnist.py", line 286, in <module>
    absl_app.run(main)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/absl/app.py", line 274, in run
    _run_main(main, args)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/absl/app.py", line 238, in _run_main
    sys.exit(main(argv))
  File "mnist.py", line 280, in main
    run_mnist(flags.FLAGS)
  File "mnist.py", line 226, in run_mnist
    'data_format': data_format,
  File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 190, in __init__
    model_dir)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1591, in maybe_overwrite_model_dir_and_session_config
    config = run_config.RunConfig.replace(config, session_config=session_config)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/run_config.py", line 849, in replace
    copy.deepcopy(self),
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 169, in deepcopy
    rv = reductor(4)
TypeError: can't pickle _thread._local objects

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 17 (14 by maintainers)

Most upvoted comments

@dmitrievanthony But you haven’t specified remote_cluster in your code?

yuefengz on Sep 19, 2018

Hi, @khaitranvan96kt. The issue was closed, so I removed examples. If you are interested in similar use cases you can have a look at the example I’ve prepared and tried to merge into “ecosystem” module: https://github.com/tensorflow/ecosystem/pull/101.

dmitrievanthony on Jan 18, 2019

This should fix the problem: https://github.com/tensorflow/tensorflow/commit/f10b00558de87020554c9c0512537dab96dba918

yuefengz on Sep 21, 2018

@dmitrievanthony Some bug has been introduced after 1.11 release. I will fix it soon. If you are using nightly build, you can comment out this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/collective_all_reduce_strategy.py#L232. Anyway, my PR should be merged soon.

yuefengz on Sep 20, 2018

@yuefengz, @harshini-gadige, yes I use standalone client mode.

I’ve prepared an example based on keras_model_to_estimator_client.py. Please take a look here. It uses FixedLengthRecordDataset and CollectiveAllReduceStrategy strategy.

I don’t use Kubernetes, so to reproduce this problem I do the following:

Start worker1.py.
Start worker2.py.
Set TF_CONFIG='{"cluster":{"worker":["localhost:1111", "localhost:1112"],"chief":["localhost:1113"]}, "task":{"type":"chief","index":0}}'
Start keras_model_to_estimator_client.py. Get the exception.

screen

dmitrievanthony on Sep 18, 2018