tensorflow: Distributed training fails when I use CollectiveAllReduceStrategy
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I slightly updated mnist.py example so that it uses CollectiveAllReduceStrategy. Updated version is here.
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS 10.13.3
-
TensorFlow installed from (source or binary): Binary
-
TensorFlow version (use command below): 1.11.0-dev20180913
-
Python version: 3.6.3
-
Exact command to reproduce: See updated example.
Describe the problem
Hi,
I’m trying to update mnist model from official repository so that it uses CollectiveAllReduceStrategy as it’s shown in keras_model_to_estimator_client.py. Updated example you can find here. Unfortunately, it fails on deepcopy of run config.
Source code / logs
Traceback (most recent call last):
File "mnist.py", line 286, in <module>
absl_app.run(main)
File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/absl/app.py", line 274, in run
_run_main(main, args)
File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/absl/app.py", line 238, in _run_main
sys.exit(main(argv))
File "mnist.py", line 280, in main
run_mnist(flags.FLAGS)
File "mnist.py", line 226, in run_mnist
'data_format': data_format,
File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 190, in __init__
model_dir)
File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1591, in maybe_overwrite_model_dir_and_session_config
config = run_config.RunConfig.replace(config, session_config=session_config)
File "/Users/antondmitriev/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/run_config.py", line 849, in replace
copy.deepcopy(self),
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/Users/antondmitriev/anaconda3/lib/python3.6/copy.py", line 169, in deepcopy
rv = reductor(4)
TypeError: can't pickle _thread._local objects
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 17 (14 by maintainers)
@dmitrievanthony But you haven’t specified
remote_clusterin your code?Hi, @khaitranvan96kt. The issue was closed, so I removed examples. If you are interested in similar use cases you can have a look at the example I’ve prepared and tried to merge into “ecosystem” module: https://github.com/tensorflow/ecosystem/pull/101.
This should fix the problem: https://github.com/tensorflow/tensorflow/commit/f10b00558de87020554c9c0512537dab96dba918
@dmitrievanthony Some bug has been introduced after 1.11 release. I will fix it soon. If you are using nightly build, you can comment out this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/collective_all_reduce_strategy.py#L232. Anyway, my PR should be merged soon.
@yuefengz, @harshini-gadige, yes I use standalone client mode.
I’ve prepared an example based on keras_model_to_estimator_client.py. Please take a look here. It uses
FixedLengthRecordDatasetandCollectiveAllReduceStrategystrategy.I don’t use Kubernetes, so to reproduce this problem I do the following:
TF_CONFIG='{"cluster":{"worker":["localhost:1111", "localhost:1112"],"chief":["localhost:1113"]}, "task":{"type":"chief","index":0}}'