DeepReg: error after latest git pull (main branch)

getting this error while launching a training script which was previously working (with the same configuration file):

Traceback (most recent call last):
  File "demos/paired_MMIV/demo_train.py", line 29, in <module>
    log_dir=log_dir,
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 94, in train
    max_epochs=max_epochs,
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 45, in build_config
    config = config_parser.load_configs(config_path)
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/parser.py", line 41, in load_configs
    config_sanity_check(config)
  File "/home/charlie/3DREG-tests/DeepReg/deepreg/parser.py", line 90, in config_sanity_check
    if config["train"]["method"] == "conditional":
KeyError: 'method'

the script is

from deepreg.train import train
import sys

######## TRAINING ########

gpu = "2,3"
gpu_allow_growth = True
log_dir = "paired_logs_train"

config_path = [
    r"demos/paired_MMIV/paired_MMIV_train.yaml",
    r"demos/paired_MMIV/paired_MMIV.yaml",
]

if len(sys.argv) != 2:
    print('using training log dir ', log_dir)
else:
    log_dir = "paired_logs_train_"+sys.argv[1]
    print('using training log dir ', log_dir)

#ckpt_path = "logs/paired_logs_train_10000epochs_gmi_unet_lw1rw1_learnrate1e-5/save/weights-epoch10000.ckpt" # restart from saved model
ckpt_path = ""

train(
    gpu=gpu,
    config_path=config_path,
    gpu_allow_growth=gpu_allow_growth,
    ckpt_path=ckpt_path,
    log_dir=log_dir,
)

the config file I use

train:
  model:
    method: "ddf" # ddf / dvf / conditional
    backbone: "unet" # options include "local", "unet" and "global" - use "global" when method=="affine"
    local:
      num_channel_initial: 4 # 4
      extract_levels: [0, 1, 2, 3] # [0, 1, 2, 3]
    unet:
      num_channel_initial: 16
      depth: 3 # original 3
      pooling: true
      concat_skip: true

  loss:
    dissimilarity:
      image:
        name: "gmi"  # "lncc" (local normalised cross correlation), "ssd" (sum of squared distance) and "gmi" (differentiable global mutual information loss via Parzen windowing method)
        weight: 1.0
      label:
        weight: 0.0
        name: "multi_scale"
        multi_scale:
          loss_type: "dice" # options include "dice", "cross-entropy", "mean-squared", "generalised_dice" and "jaccard"
          loss_scales: [0, 1, 2, 4, 8, 16]
    regularization:
      weight: 1.0
      energy_type: "gradient-l2" # "bending" # options include "bending", "gradient-l1" and "gradient-l2"

  preprocess:
    batch_size: 2 #original 2
    shuffle_buffer_num_batch: 1

  optimizer:
    name: "adam"
    adam:
      learning_rate: 1.0e-5 #1.0e-2 bad
    sgd:
      learning_rate: 1.0e-4
      momentum: 0.9
    rms:
      learning_rate: 1.0e-4
      momentum: 0.9

  epochs: 10000
  save_period: 500

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 41 (20 by maintainers)

Most upvoted comments

🤦 I had picked the wrong ones ofc, the indexes are 0,1,2,3 and I was assuming 1,2,3,4 without thinking… it’s running on 2 GPUs with batch_size 2 now and no errors

ok I checked with nvidia-smi to pick the most free ones but it may just be that there’s not enough free memory available right now, thanks a lot for the check!

4 Teslas, but I can’t use all of them together, to leave some room for other people’s tasks…

Sure. I think there’s a bug. The OOM might be because it is trying to use all GPUs:) Will fix it

Hi @ciphercharly, with the latest main we just added the back-compatibility for the config, hope this could be helpful ^^

https://github.com/DeepRegNet/DeepReg/blob/main/deepreg/config/parser.py#L101

happy new year to you too! will do, hopefully on Monday 18th

no problem! on the contrary, it’s good from my perspective to be using something that it’s being actively developed and improved

thanks! I will try asap, is a similar simplification required for the optimizer part too?

  optimizer:
    name: "adam"
    adam:
      learning_rate: 1.0e-5 #1.0e-2 bad
    sgd:
      learning_rate: 1.0e-4
      momentum: 0.9
    rms:
      learning_rate: 1.0e-4
      momentum: 0.9

For the optimizer, in current main it has not been refactored yet. So for now you do not need to change that.

It will be like this in the near future.

@ciphercharly Yes, any verified/tagged release should do.

thanks a lot for the answer! so if for now I want to have working code would git checkout 499-release-v010 do the job?

Hi @ciphercharly , this is not a problem with your code but a problem with backcompatibility of our code.

This is related to issue #525 and issue #567 #568.

We updated DeepReg with PR 568 to use a Registry type class to allow for flexibility in the definition of new losses, architectures… whilst still using the train function. This brought about changes into the config file, which explains why your otherwise functional code is now not working after a pull from main. It also means that from these PRs onwards the code is not backcompatible (as you are experiencing), which is why we are aiming at a release sometime in the next two weeks which should resolve this issue. For the moment there are two hot fixes:

a) modify the config file you are using, which should allow you to proceed as normal - I am writing docs in #525 but am missing one argument to complete (@mathpluscode will be able to advise for how you should modify your config)

b) if you are just working with no modifs on the DeepReg code, you can checkout to the commit before merge of issue #567, which will allow you to work without this bug.