OCNet.pytorch: Segmentation fault

The model fails to do a forward pass in the train step. The error reported is just “Segmentation fault” :-

dataset          cityscapes_train
batch_size       1
data_dir         ./dataset/cityscapes
data_list        ./dataset/list/cityscapes/train.lst
ignore_label     255
input_size       769,769
is_training      False
learning_rate    0.01
momentum         0.9
not_restore_last False
num_classes      19
start_iters      0
num_steps        40000
power            0.9
random_mirror    True
random_scale     True
random_seed      304
restore_from     ./pretrained_model/resnet101-imagenet.pth
save_num_images  2
save_pred_every  5000
snapshot_dir     checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/
weight_decay     0.0005
gpu              0,3,4
ohem_thres       0.7
ohem_thres1      0.8
ohem_thres2      0.5
use_weight       True
use_val          False
use_extra        False
ohem             False
ohem_keep        0
network          resnet101
method           asp_oc_dsn
reduce           True
ohem_single      False
use_parallel     False
dsn_weight       0.4
pair_weight      1
seed             304
output_path      ./seg_output_eval_set
store_output     False
use_flip         False
use_ms           False
predict_choice   whole
whole_scale      1
start_epochs     0
end_epochs       120
save_epoch       20
criterion        ce
eval             False
fix_lr           False
log_file         
use_normalize_transform False
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:69: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.weight, 0)
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:70: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.bias, 0)
/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 3 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
w/ class balance
41650 images are loaded!
learning_rate: 0.01
torch.Size([1, 3, 769, 769])
Segmentation fault

I added a bunch of print statements and saw that the error is happening in the step

preds = model(images)

I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it’s not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?

About this issue

Original URL
State: open
Created 6 years ago
Comments: 25

Most upvoted comments

@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ’ inplace_abn’ to solve the problem .

Best,

lyxlynn on Oct 17, 2018