apex: Segmentation fault
I’m getting a segmentation fault when trying to train a model with amp.
torch version ‘1.0.1.post2’ cudnn version 7.4.2
@mcarilli What could cause those issues?
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Segmentation fault (core dumped)
:[System Logs]:
:Mar 29 12:23:22 server kernel: python3.6[95957]: segfault at b ip 00007f6e08bc664c sp 00007ffe63a06200 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7f6e08bb3000+64000]
:Mar 29 12:23:22 server abrt-hook-ccpp[96042]: Process 95957 (python3.6) of user 992434 killed by SIGSEGV - dumping core
:Mar 29 12:26:09 server kernel: python3.6[99444]: segfault at b ip 00007f502c4b764c sp 00007ffd04c510a0 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7f502c4a4000+64000]
:Mar 29 12:26:09 server abrt-hook-ccpp[99522]: Process 99444 (python3.6) of user 992434 killed by SIGSEGV - dumping core
:Mar 29 12:44:58 server kernel: python3.6[106167]: segfault at b ip 00007fa96640c64c sp 00007ffce94225f0 error 6 in amp_C.cpython-36m-x86_64-linux-gnu.so[7fa9663f9000+64000]
:Mar 29 12:44:58 server abrt-hook-ccpp[106244]: Process 106167 (python3.6) of user 992434 killed by SIGSEGV - dumping core
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21 (8 by maintainers)
@che85 You saved my day and my job!! I changed the gcc from 4.8.5 to 5.4.0 and magically, all things worked. Thank you
You know what? I think, I just figured it out. I just recompiled it with gcc 5.3.1 (before 4.8.5) and now it’s training! If everything works fine, I will just close this issue.