DeepReg: error while using "gmi" for the loss
Subject of the issue
getting tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary]
while trying to use ‘gmi’ in several scenarios (e.g. in the demos)
If the bug is confirmed, would you be willing to submit a PR? (Help can be provided if you need assistance submitting a PR)
No
Your environment
-
DeepReg version (commit hash or tag): 0.1.0b1 (from
git rev-parse HEAD
: 8b8d75fdaaf89be2dfefc1d5c3c37e3ef26fd7d1) -
OS: Linux 4.15.0-112-generic #113-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
-
Python Version: 3.7.9
-
TensorFlow: 2.2.0
Steps to reproduce
modified the grouped_mr_heart demo yaml file with ‘gmi’ instead of ‘lncc’ and then run
deepreg_train --gpu "3" --config_path demos/grouped_mr_heart/grouped_mr_heart.yaml --log_dir grouped_mr_heart
log
1/9 [==>...........................] - ETA: 0s - loss/weighted_regularization: 0.0000e+00 - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: 0.0000e+00 - loss/image_dissimilarity: nan2020-10-15 08:42:22.326944: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-10-15 08:42:22.330619: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216] GpuTracer has collected 0 callback api events and 0 activity events.
2020-10-15 08:42:22.349700: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22
2020-10-15 08:42:22.352329: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.trace.json.gz
2020-10-15 08:42:22.353773: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.001 ms
2020-10-15 08:42:22.355437: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22Dumped tool data for overview_page.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.overview_page.pb
Dumped tool data for input_pipeline.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.kernel_stats.pb
2/9 [=====>........................] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar3/9 [=========>....................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar4/9 [============>.................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar5/9 [===============>..............] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar6/9 [===================>..........] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar7/9 [======================>.......] - ETA: 1s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar8/9 [=========================>....] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar9/9 [==============================] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: nan - loss/image_dissimilarity: nan2020-10-15 08:42:34.992438: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at summary_kernels.cc:242 : Invalid argument: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0
Traceback (most recent call last):
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 464, in write_histogram_summary
tld.op_callbacks, writer, step, tag, values)
tensorflow.python.eager.core._FallbackException: This function does not handle the case of the path where all inputs are not already EagerTensors.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/charlie/anaconda3/envs/deepreg/bin/deepreg_train", line 33, in <module>
sys.exit(load_entry_point('deepreg', 'console_scripts', 'deepreg_train')())
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 227, in main
log_dir=args.log_dir,
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 154, in train
callbacks=callbacks,
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 876, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 365, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2000, in on_epoch_end
self._log_weights(epoch)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2119, in _log_weights
summary_ops_v2.histogram(weight_name, weight, step=epoch)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 830, in histogram
return summary_writer_function(name, tensor, function, family=family)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 759, in summary_writer_function
should_record_summaries(), record, _nothing, name="")
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 752, in record
with ops.control_dependencies([function(tag, scope)]):
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 828, in function
name=scope)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 469, in write_histogram_summary
writer, step, tag, values, name=name, ctx=_ctx)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 490, in write_histogram_summary_eager_fallback
attrs=_attrs, ctx=ctx, name=name)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17 (11 by maintainers)
Commits related to this issue
- Issue #452: use eps instead of div_no_nan for division — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
- Issue #452: update EPS in logarithms and denominators (not in numerators) — committed to DeepRegNet/DeepReg by acasamitjana 4 years ago
- Issue #452: fix tests related to EPS changes 0/0 = 1 now and increase an error tolerance of one test — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
- Issue #452: add scipy into requirement and ignore demo dataset — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
- Issue #452: add eps to both numerator and denominator — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
- Issue #452: update changelog about fix on division by zero — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
- Merge pull request #454 from DeepRegNet/452-err-while-using-gmi-for-the-loss Issue #452: use eps instead of div_no_nan for division — committed to DeepRegNet/DeepReg by mathpluscode 4 years ago
Hi @ciphercharly the fix has been integrated into the
main
branch now, feel free to test again 😉 Please reopen this ticket if there’s still error!tested quickly, seems to run without errors with custom model/data too 👍
Sure, we might merge the fix into
main
soon, but I will re-open this ticket if it got closed automatically 😉 as the bug is solved using demo data.Hi @ciphercharly, I created a new conda env from scratch and I did find a missing package which is
scipy
, then it’s all working fine including the prediction.My commands are:
with the following config
If you are able to reproduce the error, could you please share your config and commands? By the way which data are you using?
Sure, will do a new conda Env or using docker to check this. Super thx!
I haven’t saved the message but the pip error (which btw contained a line about the recent changes in the packages dependency system, so could be due to that) stated that dataclasses was missing and required by another package… fslpy? if I am not mistaken? am sure it started by f… only other option is flake8