DeepReg: error after latest git pull (main branch)
getting this error while launching a training script which was previously working (with the same configuration file):
Traceback (most recent call last):
File "demos/paired_MMIV/demo_train.py", line 29, in <module>
log_dir=log_dir,
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 94, in train
max_epochs=max_epochs,
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 45, in build_config
config = config_parser.load_configs(config_path)
File "/home/charlie/3DREG-tests/DeepReg/deepreg/parser.py", line 41, in load_configs
config_sanity_check(config)
File "/home/charlie/3DREG-tests/DeepReg/deepreg/parser.py", line 90, in config_sanity_check
if config["train"]["method"] == "conditional":
KeyError: 'method'
the script is
from deepreg.train import train
import sys
######## TRAINING ########
gpu = "2,3"
gpu_allow_growth = True
log_dir = "paired_logs_train"
config_path = [
r"demos/paired_MMIV/paired_MMIV_train.yaml",
r"demos/paired_MMIV/paired_MMIV.yaml",
]
if len(sys.argv) != 2:
print('using training log dir ', log_dir)
else:
log_dir = "paired_logs_train_"+sys.argv[1]
print('using training log dir ', log_dir)
#ckpt_path = "logs/paired_logs_train_10000epochs_gmi_unet_lw1rw1_learnrate1e-5/save/weights-epoch10000.ckpt" # restart from saved model
ckpt_path = ""
train(
gpu=gpu,
config_path=config_path,
gpu_allow_growth=gpu_allow_growth,
ckpt_path=ckpt_path,
log_dir=log_dir,
)
the config file I use
train:
model:
method: "ddf" # ddf / dvf / conditional
backbone: "unet" # options include "local", "unet" and "global" - use "global" when method=="affine"
local:
num_channel_initial: 4 # 4
extract_levels: [0, 1, 2, 3] # [0, 1, 2, 3]
unet:
num_channel_initial: 16
depth: 3 # original 3
pooling: true
concat_skip: true
loss:
dissimilarity:
image:
name: "gmi" # "lncc" (local normalised cross correlation), "ssd" (sum of squared distance) and "gmi" (differentiable global mutual information loss via Parzen windowing method)
weight: 1.0
label:
weight: 0.0
name: "multi_scale"
multi_scale:
loss_type: "dice" # options include "dice", "cross-entropy", "mean-squared", "generalised_dice" and "jaccard"
loss_scales: [0, 1, 2, 4, 8, 16]
regularization:
weight: 1.0
energy_type: "gradient-l2" # "bending" # options include "bending", "gradient-l1" and "gradient-l2"
preprocess:
batch_size: 2 #original 2
shuffle_buffer_num_batch: 1
optimizer:
name: "adam"
adam:
learning_rate: 1.0e-5 #1.0e-2 bad
sgd:
learning_rate: 1.0e-4
momentum: 0.9
rms:
learning_rate: 1.0e-4
momentum: 0.9
epochs: 10000
save_period: 500
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 41 (20 by maintainers)
🤦 I had picked the wrong ones ofc, the indexes are 0,1,2,3 and I was assuming 1,2,3,4 without thinking… it’s running on 2 GPUs with batch_size 2 now and no errors
ok I checked with
nvidia-smi
to pick the most free ones but it may just be that there’s not enough free memory available right now, thanks a lot for the check!Sure. I think there’s a bug. The OOM might be because it is trying to use all GPUs:) Will fix it
Hi @ciphercharly, with the latest
main
we just added the back-compatibility for the config, hope this could be helpful ^^https://github.com/DeepRegNet/DeepReg/blob/main/deepreg/config/parser.py#L101
happy new year to you too! will do, hopefully on Monday 18th
no problem! on the contrary, it’s good from my perspective to be using something that it’s being actively developed and improved
For the optimizer, in current
main
it has not been refactored yet. So for now you do not need to change that.It will be like this in the near future.
@ciphercharly Yes, any verified/tagged release should do.
thanks a lot for the answer! so if for now I want to have working code would
git checkout 499-release-v010
do the job?Hi @ciphercharly , this is not a problem with your code but a problem with backcompatibility of our code.
This is related to issue #525 and issue #567 #568.
We updated DeepReg with PR 568 to use a Registry type class to allow for flexibility in the definition of new losses, architectures… whilst still using the train function. This brought about changes into the config file, which explains why your otherwise functional code is now not working after a pull from main. It also means that from these PRs onwards the code is not backcompatible (as you are experiencing), which is why we are aiming at a release sometime in the next two weeks which should resolve this issue. For the moment there are two hot fixes:
a) modify the config file you are using, which should allow you to proceed as normal - I am writing docs in #525 but am missing one argument to complete (@mathpluscode will be able to advise for how you should modify your config)
b) if you are just working with no modifs on the DeepReg code, you can checkout to the commit before merge of issue #567, which will allow you to work without this bug.