OCNet.pytorch: Segmentation fault
The model fails to do a forward pass in the train step. The error reported is just “Segmentation fault” :-
dataset cityscapes_train
batch_size 1
data_dir ./dataset/cityscapes
data_list ./dataset/list/cityscapes/train.lst
ignore_label 255
input_size 769,769
is_training False
learning_rate 0.01
momentum 0.9
not_restore_last False
num_classes 19
start_iters 0
num_steps 40000
power 0.9
random_mirror True
random_scale True
random_seed 304
restore_from ./pretrained_model/resnet101-imagenet.pth
save_num_images 2
save_pred_every 5000
snapshot_dir checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/
weight_decay 0.0005
gpu 0,3,4
ohem_thres 0.7
ohem_thres1 0.8
ohem_thres2 0.5
use_weight True
use_val False
use_extra False
ohem False
ohem_keep 0
network resnet101
method asp_oc_dsn
reduce True
ohem_single False
use_parallel False
dsn_weight 0.4
pair_weight 1
seed 304
output_path ./seg_output_eval_set
store_output False
use_flip False
use_ms False
predict_choice whole
whole_scale 1
start_epochs 0
end_epochs 120
save_epoch 20
criterion ce
eval False
fix_lr False
log_file
use_normalize_transform False
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:69: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
nn.init.constant(self.W.weight, 0)
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:70: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
nn.init.constant(self.W.bias, 0)
/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 3 which
has less than 75% of the memory or cores of GPU 1. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
w/ class balance
41650 images are loaded!
learning_rate: 0.01
torch.Size([1, 3, 769, 769])
Segmentation fault
I added a bunch of print statements and saw that the error is happening in the step
preds = model(images)
I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it’s not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 25
@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ’ inplace_abn’ to solve the problem .
Best,