DeepReg: error while using "gmi" for the loss

Subject of the issue

getting tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary] while trying to use ‘gmi’ in several scenarios (e.g. in the demos)

If the bug is confirmed, would you be willing to submit a PR? (Help can be provided if you need assistance submitting a PR)

Your environment

DeepReg version (commit hash or tag): 0.1.0b1 (from git rev-parse HEAD: 8b8d75fdaaf89be2dfefc1d5c3c37e3ef26fd7d1)
OS: Linux 4.15.0-112-generic #113-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
Python Version: 3.7.9
TensorFlow: 2.2.0

Steps to reproduce

modified the grouped_mr_heart demo yaml file with ‘gmi’ instead of ‘lncc’ and then run deepreg_train --gpu "3" --config_path demos/grouped_mr_heart/grouped_mr_heart.yaml --log_dir grouped_mr_heart

log

1/9 [==>...........................] - ETA: 0s - loss/weighted_regularization: 0.0000e+00 - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: 0.0000e+00 - loss/image_dissimilarity: nan2020-10-15 08:42:22.326944: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-10-15 08:42:22.330619: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 0 callback api events and 0 activity events.
2020-10-15 08:42:22.349700: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22
2020-10-15 08:42:22.352329: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.trace.json.gz
2020-10-15 08:42:22.353773: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.001 ms

2020-10-15 08:42:22.355437: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22Dumped tool data for overview_page.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.overview_page.pb
Dumped tool data for input_pipeline.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.kernel_stats.pb

2/9 [=====>........................] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar3/9 [=========>....................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar4/9 [============>.................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar5/9 [===============>..............] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar6/9 [===================>..........] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar7/9 [======================>.......] - ETA: 1s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar8/9 [=========================>....] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar9/9 [==============================] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: nan - loss/image_dissimilarity: nan2020-10-15 08:42:34.992438: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at summary_kernels.cc:242 : Invalid argument: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0
Traceback (most recent call last):
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 464, in write_histogram_summary
    tld.op_callbacks, writer, step, tag, values)
tensorflow.python.eager.core._FallbackException: This function does not handle the case of the path where all inputs are not already EagerTensors.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/charlie/anaconda3/envs/deepreg/bin/deepreg_train", line 33, in <module>
    sys.exit(load_entry_point('deepreg', 'console_scripts', 'deepreg_train')())
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 227, in main
    log_dir=args.log_dir,
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 154, in train
    callbacks=callbacks,
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 876, in fit
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 365, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2000, in on_epoch_end
    self._log_weights(epoch)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2119, in _log_weights
    summary_ops_v2.histogram(weight_name, weight, step=epoch)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 830, in histogram
    return summary_writer_function(name, tensor, function, family=family)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 759, in summary_writer_function
    should_record_summaries(), record, _nothing, name="")
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
    return true_fn()
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 752, in record
    with ops.control_dependencies([function(tag, scope)]):
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 828, in function
    name=scope)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 469, in write_histogram_summary
    writer, step, tag, values, name=name, ctx=_ctx)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 490, in write_histogram_summary_eager_fallback
    attrs=_attrs, ctx=ctx, name=name)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary]

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 17 (11 by maintainers)

Commits related to this issue

Issue #452: use eps instead of div_no_nan for division — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Issue #452: update EPS in logarithms and denominators (not in numerators) — committed to DeepRegNet/DeepReg by acasamitjana 4 years ago
Issue #452: fix tests related to EPS changes 0/0 = 1 now and increase an error tolerance of one test — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Issue #452: add scipy into requirement and ignore demo dataset — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Issue #452: add eps to both numerator and denominator — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Issue #452: update changelog about fix on division by zero — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Merge pull request #454 from DeepRegNet/452-err-while-using-gmi-for-the-loss Issue #452: use eps instead of div_no_nan for division — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago

Most upvoted comments

Hi @ciphercharly the fix has been integrated into the main branch now, feel free to test again 😉 Please reopen this ticket if there’s still error!

mathpluscode on Oct 25, 2020

tested quickly, seems to run without errors with custom model/data too 👍

ciphercharly on Oct 17, 2020

great! will try asap to run the mr_heart demo and then my custom one too (I have limited access to the machine where I can run this … likely Monday). I am working with pelvic MRI multi-mode images from a set of patients (same machine for all of them), and I need to register intra-subjects the different channels, so I use the ‘paired’ scenario (but for now I am using only the T2 and vibe channels)

Sure, we might merge the fix into main soon, but I will re-open this ticket if it got closed automatically 😉 as the bug is solved using demo data.

mathpluscode on Oct 16, 2020

update: trying to run predictions on the model trained with gmi results in this error

coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-10-16 09:52:42.484648: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 09:52:42.484686: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-16 09:52:42.484713: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-16 09:52:42.484740: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-16 09:52:42.484766: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-16 09:52:42.484792: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-16 09:52:42.484818: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-16 09:52:42.505351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2
2020-10-16 09:52:42.505395: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-16 09:52:43.898745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-16 09:52:43.898791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 2 
2020-10-16 09:52:43.898801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N Y Y 
2020-10-16 09:52:43.898807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   Y N Y 
2020-10-16 09:52:43.898814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 2:   Y Y N 
2020-10-16 09:52:43.906845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 29461 MB memory) -> physical GPU (device: 0, name: Tesla V100-DGXS-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0)
2020-10-16 09:52:43.909464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 29682 MB memory) -> physical GPU (device: 1, name: Tesla V100-DGXS-32GB, pci bus id: 0000:08:00.0, compute capability: 7.0)
2020-10-16 09:52:43.911828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30129 MB memory) -> physical GPU (device: 2, name: Tesla V100-DGXS-32GB, pci bus id: 0000:0e:00.0, compute capability: 7.0)
2020-10-16 09:52:52.785402: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-16 09:52:54.375699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-16 09:53:07.676152: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_grad_ops_3d.cc:1170 : Invalid argument: Conv3DBackpropInputOp: input and out_backprop must have the same batch sizeinput batch: 3outbackprop batch: 1 batch_dim: 0
Traceback (most recent call last):
  File "demos/paired_MMIV/demo_predict.py", line 26, in <module>
    save_png=True,
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/predict.py", line 334, in predict
    save_png=save_png,
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/predict.py", line 81, in predict_on_dataset
    outputs_dict = model.predict(x=inputs_dict)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 130, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1599, in predict
    tmp_batch_outputs = predict_function(iterator)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 814, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 1769472 values, but the requested shape has 5308416
         [[node DDFRegistrationModelWithoutLabel/tf_op_layer_Reshape/Reshape (defined at /home/charlie/3DREG-tests/DeepReg/deepreg/predict.py:81) ]] [Op:__inference_predict_function_7614]

Function call stack:
predict_function

but this could be related to my setup/data (same exact setup/code for training/prediction with only change ‘lncc’ works tho), I have to try with one of the demos

Hi @ciphercharly, I created a new conda env from scratch and I did find a missing package which is scipy, then it’s all working fine including the prediction.

My commands are:

python demos/grouped_mr_heart/demo_data.py

deepreg_train --gpu "" --config_path demos/grouped_mr_heart/grouped_mr_heart.yaml --log_dir grouped_mr_heart

deepreg_predict --gpu "" --config_path demos/grouped_mr_heart/grouped_mr_heart.yaml --ckpt_path logs/grouped_mr_heart/save/weights-epoch6.ckpt --save_png --mode test

with the following config

dataset:
  dir:
    train: "demos/grouped_mr_heart/dataset/train"
    valid: "demos/grouped_mr_heart/dataset/val"
    test: "demos/grouped_mr_heart/dataset/test"
  format: "nifti"
  type: "grouped" # paired / unpaired / grouped
  labeled: false
  intra_group_prob: 1
  intra_group_option: "unconstrained" # forward / backward / unconstrained
  sample_image_in_group: true
  image_shape: [32, 32, 28]

train:
  # define neural network structure
  model:
    method: "ddf" # the registration method, value should be ddf / dvf / conditional
    backbone: "local" # value should be local / global / unet
    local:
      num_channel_initial: 16 # number of initial channel in local net, controls the size of the network
      extract_levels: [0, 1, 2, 3]

  # define the loss function for training
  loss:
    dissimilarity:
      image:
        name: "gmi"
        weight: 1.0
      label:
        weight: 0.0
        name: "multi_scale"
        multi_scale:
          loss_type: "dice"
          loss_scales: [0, 1, 2, 4, 8, 16]
        single_scale:
          loss_type: "cross-entropy"
    regularization:
      weight: 100 # weight of regularization loss
      energy_type: "bending" # value should be bending / gradient-l1 / gradient-l2

  # define the optimizer
  optimizer:
    name: "adam" # value should be adam / sgd / rms
    adam:
      learning_rate: 1.0e-4

  preprocess:
    batch_size: 4
    shuffle_buffer_num_batch: 1 # shuffle_buffer_size = batch_size * shuffle_buffer_num_batch

  # other training hyper-parameters
  epochs: 6 # number of training epochs
  save_period: 2 # the model will be saved every `save_period` epochs.

If you are able to reproduce the error, could you please share your config and commands? By the way which data are you using?

mathpluscode on Oct 16, 2020

I haven’t saved the message but the pip error (which btw contained a line about the recent changes in the packages dependency system, so could be due to that) stated that dataclasses was missing and required by another package… fslpy? if I am not mistaken? am sure it started by f… only other option is flake8

Sure, will do a new conda Env or using docker to check this. Super thx!

mathpluscode on Oct 16, 2020

I haven’t saved the message but the pip error (which btw contained a line about the recent changes in the packages dependency system, so could be due to that) stated that dataclasses was missing and required by another package… fslpy? if I am not mistaken? am sure it started by f… only other option is flake8

ciphercharly on Oct 16, 2020