tensorflow: Unable to train a LinearClassifier with categorical columns and CollectiveAllReduce
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): v1.12.0-0-ga6d8ffae09 1.12.0
- Python version: 3.6.6
- Bazel version (if compiling from source): N/A
- GCC/Compiler version (if compiling from source): N/A
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
I’m trying to train a simple model LinearClassifier (tf.estimator.LinearClassifier) with different distribution strategies. I’ve successfully managed to train a model with parameter servers with numeric columns (tf.feature_column.numeric_column) and categorical columns (tf.feature_column.categorical_column_with_*). I’ve also successfully managed to train a model with CollectiveAllReduce and only numeric columns. But unfortunately, I’m getting the following error whith the same model (with CollectiveAllReduce) but with one categorical column in place of a numeric column:
ValueError:
IndexSlicesis not supported for Collective All-Reduce.
See below for the all traceback and logs.
Here is a part of the code I am running:
estimator = tf.estimator.LinearClassifier(
feature_columns=[
tf.feature_column.categorical_column_with_hash_bucket("partnerid", 13, dtype=tf.int64),
tf.feature_column.numeric_column("campaignid", dtype=tf.int64)
],
model_dir="my_path",
n_classes=2,
optimizer="Adam",
config=tf.estimator.RunConfig(
experimental_distribute=tf.contrib.distribute.DistributeConfig(
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(),
remote_cluster=cluster_spec
)
)
)
tf.estimator.train_and_evaluate(
estimator,
tf.estimator.TrainSpec(
input_fn_train,
max_steps=training_steps
),
tf.estimator.EvalSpec(
input_fn_test,
steps=evaluation_steps,
start_delay_secs=0,
throttle_secs=evaluation_throttle_secs
)
)
Is it a known issue ? Are there current limitations with CollectiveAllReduce ?
INFO:tensorflow:Waiting for worker:0/init
tensorflow - Waiting for worker:0/init
INFO:tensorflow:Waiting for worker:1/init
tensorflow - Waiting for worker:1/init
cluster_spec: ClusterSpec({'worker': ['10.188.17.14:42897', '10.188.50.21:48063']})
INFO:tensorflow:CollectiveAllReduceStrategy with local_devices = ['/device:CPU:0']
tensorflow - CollectiveAllReduceStrategy with local_devices = ['/device:CPU:0']
INFO:tensorflow:Initializing RunConfig with distribution strategies.
tensorflow - Initializing RunConfig with distribution strategies.
INFO:tensorflow:RunConfig initialized for Distribute Coordinator with STANDALONE_CLIENT mode
tensorflow - RunConfig initialized for Distribute Coordinator with STANDALONE_CLIENT mode
INFO:tensorflow:Using config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'standalone_client'}
tensorflow - Using config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'standalone_client'}
INFO:tensorflow:Running `train_and_evaluate` with Distribute Coordinator.
tensorflow - Running `train_and_evaluate` with Distribute Coordinator.
INFO:tensorflow:Running Distribute Coordinator with mode = 'standalone_client', cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = None, task_id = None, environment = None, rpc_layer = 'grpc'
tensorflow - Running Distribute Coordinator with mode = 'standalone_client', cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = None, task_id = None, environment = None, rpc_layer = 'grpc'
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
tensorflow - `eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
INFO:tensorflow:Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ['/job:worker/task:0']
tensorflow - Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ['/job:worker/task:0']
INFO:tensorflow:Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ['/job:worker/task:1']
tensorflow - Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ['/job:worker/task:1']
INFO:tensorflow:Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3710>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7601d0>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760438>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7603c8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.188.17.14:42897', '_evaluation_master': 'grpc://10.188.17.14:42897', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}
tensorflow - Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3710>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7601d0>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760438>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7603c8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.188.17.14:42897', '_evaluation_master': 'grpc://10.188.17.14:42897', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}INFO:tensorflow:Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3cc0>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a760630>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760898>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760828>, '_task_type': 'worker', '_task_id': 1, '_global_id_in_cluster': 1, '_master': 'grpc://10.188.50.21:48063', '_evaluation_master': 'grpc://10.188.50.21:48063', '_is_chief': False, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}
tensorflow - Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3cc0>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a760630>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760898>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760828>, '_task_type': 'worker', '_task_id': 1, '_global_id_in_cluster': 1, '_master': 'grpc://10.188.50.21:48063', '_evaluation_master': 'grpc://10.188.50.21:48063', '_is_chief': False, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}
2018-11-27 13:25:56.606318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
INFO:tensorflow:Calling model_fn.
tensorflow - Calling model_fn.
WARNING:tensorflow:Partitioned variables are disabled when using DistributionStrategy.
tensorflow - Partitioned variables are disabled when using DistributionStrategy.
INFO:tensorflow:Calling model_fn.
tensorflow - Calling model_fn.
DEBUG:tensorflow:Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
tensorflow - Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
DEBUG:tensorflow:Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
tensorflow - Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
DEBUG:tensorflow:Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
tensorflow - Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
DEBUG:tensorflow:Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
tensorflow - Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
INFO:tensorflow:Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
tensorflow - Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 344, in _run_single_worker
worker_fn(strategy)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 246, in _worker_fn
hooks=hooks)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
self.config)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
return self._call_for_each_tower(fn, *args, **kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
return _call_for_each_tower(self, fn, *args, **kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
coord.join(threads)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
INFO:tensorflow:Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
tensorflow - Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 344, in _run_single_worker
worker_fn(strategy)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 246, in _worker_fn
hooks=hooks)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
self.config)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
return self._call_for_each_tower(fn, *args, **kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
return _call_for_each_tower(self, fn, *args, **kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
coord.join(threads)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
variable_scope.VariableAggregation.SUM, grads_and_vars)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
return self._batch_reduce(aggregation, value_destination_pairs)
File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
"`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 19 (9 by maintainers)
SparseTensor and IndexSlices are currently not supported since it requires all-gather. @poxvoculi @dubey