tensorflow: Unable to train a LinearClassifier with categorical columns and CollectiveAllReduce

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.12.0-0-ga6d8ffae09 1.12.0
  • Python version: 3.6.6
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

I’m trying to train a simple model LinearClassifier (tf.estimator.LinearClassifier) with different distribution strategies. I’ve successfully managed to train a model with parameter servers with numeric columns (tf.feature_column.numeric_column) and categorical columns (tf.feature_column.categorical_column_with_*). I’ve also successfully managed to train a model with CollectiveAllReduce and only numeric columns. But unfortunately, I’m getting the following error whith the same model (with CollectiveAllReduce) but with one categorical column in place of a numeric column:

ValueError: IndexSlices is not supported for Collective All-Reduce.

See below for the all traceback and logs.

Here is a part of the code I am running:

estimator = tf.estimator.LinearClassifier(
        feature_columns=[
            tf.feature_column.categorical_column_with_hash_bucket("partnerid", 13, dtype=tf.int64),
            tf.feature_column.numeric_column("campaignid", dtype=tf.int64)
        ],
        model_dir="my_path",
        n_classes=2,
        optimizer="Adam",
        config=tf.estimator.RunConfig(
            experimental_distribute=tf.contrib.distribute.DistributeConfig(
                train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(),
                remote_cluster=cluster_spec
            )
        )
    )

    tf.estimator.train_and_evaluate(
        estimator,
        tf.estimator.TrainSpec(
            input_fn_train,
            max_steps=training_steps
        ),
        tf.estimator.EvalSpec(
            input_fn_test,
            steps=evaluation_steps,
            start_delay_secs=0,
            throttle_secs=evaluation_throttle_secs
        )
    )

Is it a known issue ? Are there current limitations with CollectiveAllReduce ?

INFO:tensorflow:Waiting for worker:0/init
tensorflow - Waiting for worker:0/init
INFO:tensorflow:Waiting for worker:1/init
tensorflow - Waiting for worker:1/init
cluster_spec: ClusterSpec({'worker': ['10.188.17.14:42897', '10.188.50.21:48063']})
INFO:tensorflow:CollectiveAllReduceStrategy with local_devices = ['/device:CPU:0']
tensorflow - CollectiveAllReduceStrategy with local_devices = ['/device:CPU:0']
INFO:tensorflow:Initializing RunConfig with distribution strategies.
tensorflow - Initializing RunConfig with distribution strategies.
INFO:tensorflow:RunConfig initialized for Distribute Coordinator with STANDALONE_CLIENT mode
tensorflow - RunConfig initialized for Distribute Coordinator with STANDALONE_CLIENT mode
INFO:tensorflow:Using config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'standalone_client'}
tensorflow - Using config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7bef60>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7d3208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'standalone_client'}
INFO:tensorflow:Running `train_and_evaluate` with Distribute Coordinator.
tensorflow - Running `train_and_evaluate` with Distribute Coordinator.
INFO:tensorflow:Running Distribute Coordinator with mode = 'standalone_client', cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = None, task_id = None, environment = None, rpc_layer = 'grpc'
tensorflow - Running Distribute Coordinator with mode = 'standalone_client', cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = None, task_id = None, environment = None, rpc_layer = 'grpc'
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
tensorflow - `eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
INFO:tensorflow:Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ['/job:worker/task:0']
tensorflow - Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ['/job:worker/task:0']
INFO:tensorflow:Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ['/job:worker/task:1']
tensorflow - Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['10.188.17.14:42897', '10.188.50.21:48063']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ['/job:worker/task:1']
INFO:tensorflow:Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3710>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7601d0>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760438>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7603c8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.188.17.14:42897', '_evaluation_master': 'grpc://10.188.17.14:42897', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}
tensorflow - Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3710>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7601d0>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760438>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a7603c8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.188.17.14:42897', '_evaluation_master': 'grpc://10.188.17.14:42897', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}INFO:tensorflow:Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3cc0>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a760630>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760898>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760828>, '_task_type': 'worker', '_task_id': 1, '_global_id_in_cluster': 1, '_master': 'grpc://10.188.50.21:48063', '_evaluation_master': 'grpc://10.188.50.21:48063', '_is_chief': False, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}

tensorflow - Updated config: {'_model_dir': 'hdfs://root/user/username/model_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a7d3cc0>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f1c4a760630>, eval_distribute=None, remote_cluster=<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760898>), '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1c4a760828>, '_task_type': 'worker', '_task_id': 1, '_global_id_in_cluster': 1, '_master': 'grpc://10.188.50.21:48063', '_evaluation_master': 'grpc://10.188.50.21:48063', '_is_chief': False, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'standalone_client'}
2018-11-27 13:25:56.606318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
INFO:tensorflow:Calling model_fn.
tensorflow - Calling model_fn.
WARNING:tensorflow:Partitioned variables are disabled when using DistributionStrategy.
tensorflow - Partitioned variables are disabled when using DistributionStrategy.
INFO:tensorflow:Calling model_fn.
tensorflow - Calling model_fn.
DEBUG:tensorflow:Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
tensorflow - Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
DEBUG:tensorflow:Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
tensorflow - Transforming feature_column _NumericColumn(key='campaignid', shape=(1,), default_value=None, dtype=tf.int64, normalizer_fn=None).
DEBUG:tensorflow:Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
tensorflow - Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
DEBUG:tensorflow:Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
tensorflow - Transforming feature_column _HashedCategoricalColumn(key='partnerid', hash_bucket_size=13, dtype=tf.int64).
INFO:tensorflow:Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
tensorflow - Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 344, in _run_single_worker
    worker_fn(strategy)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 246, in _worker_fn
    hooks=hooks)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
    self.config)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
    return self._call_for_each_tower(fn, *args, **kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
    return _call_for_each_tower(self, fn, *args, **kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
    coord.join(threads)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.

INFO:tensorflow:Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
tensorflow - Error reported to Coordinator: `IndexSlices` is not supported for Collective All-Reduce.
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 344, in _run_single_worker
    worker_fn(strategy)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 246, in _worker_fn
    hooks=hooks)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
    self.config)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
    return self._call_for_each_tower(fn, *args, **kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
    return _call_for_each_tower(self, fn, *args, **kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
    coord.join(threads)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
    **merge_kwargs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
    variable_scope.VariableAggregation.SUM, grads_and_vars)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
    value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
    return self._batch_reduce(aggregation, value_destination_pairs)
  File "/home/username/miniconda3/envs/explorer2/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 829, in _batch_reduce
    "`IndexSlices` is not supported for Collective All-Reduce.")
ValueError: `IndexSlices` is not supported for Collective All-Reduce.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

SparseTensor and IndexSlices are currently not supported since it requires all-gather. @poxvoculi @dubey