ROCm: rocm 2.10: clinfo generates segfault in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1:amd::GpuAgent::GetInfo()
OS: Linux ryzendev2 5.0.0-37-generic #40~18.04.1-Ubuntu HW: UDOO Bolt V8 - IOMMU is enabled in the BIOS
(lspci and lsmod are listed below the stack trace)
I am attempting to run clinfo. I get a segfault in amd::GpuAgent::GetInfo() in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.
But libhsa-ext-image64.so.1 is stripped and so the stack from the core file is less useful than hoped: gdb /usr/bin/clinfo -c ~/core Reading symbols from /usr/bin/clinfo…(no debugging symbols found)…done. [New LWP 1531] [New LWP 1534] [Thread debugging using libthread_db enabled] Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”. Core was generated by `clinfo’. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 [Current thread is 1 (Thread 0x7f7cfc44e740 (LWP 1531))] (gdb) bt #0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 #1 0x00007f7cfae924d9 in amd::GpuAgent::GetInfo(hsa_agent_info_t, void*) const () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1 #2 0x00007f7cfaea6e68 in HSA::hsa_agent_get_info(hsa_agent_s, hsa_agent_info_t, void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1 #3 0x00007f7cfb270733 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #4 0x00007f7cfb270e32 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #5 0x00007f7cfb27245a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #6 0x00007f7cfb23f28f in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #7 0x00007f7cfb23a297 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #8 0x00007f7cfb20dad5 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #9 0x00007f7cfb3853c9 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #10 0x00007f7cfb20dc0c in clIcdGetPlatformIDsKHR () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so #11 0x00007f7cfc4563c5 in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 #12 0x00007f7cfc45818f in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 #13 0x00007f7cfb4a1827 in __pthread_once_slow (once_control=0x7f7cfc45c0d8, init_routine=0x7f7cfc457fb0) at pthread_once.c:116 #14 0x00007f7cfc4568f1 in clGetPlatformIDs () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 #15 0x000055fbf38ff722 in ?? () #16 0x00007f7cfbc78b97 in __libc_start_main (main=0x55fbf38ff5d0, argc=1, argv=0x7fff359e5e98, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff359e5e88) at …/csu/libc-start.c:310 #17 0x000055fbf38ffb3a in ?? () (gdb)
lspci: 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15d0 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 15d1 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3 00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3 00:01.7 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15db 00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15dc 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e8 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e9 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ea 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15eb 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ec 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ed 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ee 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ef 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c) 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83) 05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device 15de 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Device 15df 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e0 05:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e1 05:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Device 15e2 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Device 15e3 05:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] Device 15e6 06:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)
lsmod: Module Size Used by binfmt_misc 24576 1 nls_iso8859_1 16384 1 input_leds 16384 0 hid_generic 16384 0 usbhid 53248 0 edac_mce_amd 28672 0 kvm_amd 90112 0 ccp 86016 1 kvm_amd kvm 647168 1 kvm_amd snd_hda_codec_realtek 114688 1 irqbypass 16384 1 kvm amdgpu 3915776 11 snd_hda_codec_generic 77824 1 snd_hda_codec_realtek ledtrig_audio 16384 2 snd_hda_codec_generic,snd_hda_codec_realtek snd_hda_codec_hdmi 53248 1 snd_hda_intel 49152 5 snd_hda_codec 135168 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek snd_hda_core 86016 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek snd_hwdep 20480 1 snd_hda_codec snd_pcm 102400 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core snd_seq_midi 20480 0 amdkcl 28672 1 amdgpu snd_seq_midi_event 16384 1 snd_seq_midi amd_iommu_v2 20480 1 amdgpu amdttm 102400 1 amdgpu snd_rawmidi 36864 1 snd_seq_midi crct10dif_pclmul 16384 1 amd_sched 32768 1 amdgpu crc32_pclmul 16384 0 drm_kms_helper 180224 1 amdgpu cdc_acm 36864 0 ghash_clmulni_intel 16384 0 drm 483328 16 drm_kms_helper,amd_sched,amdttm,amdgpu,amdkcl snd_seq 69632 2 snd_seq_midi,snd_seq_midi_event aesni_intel 372736 0 i2c_algo_bit 16384 1 amdgpu fb_sys_fops 16384 1 drm_kms_helper snd_seq_device 16384 3 snd_seq,snd_seq_midi,snd_rawmidi aes_x86_64 20480 1 aesni_intel syscopyarea 16384 1 drm_kms_helper crypto_simd 16384 1 aesni_intel cryptd 24576 3 crypto_simd,ghash_clmulni_intel,aesni_intel glue_helper 16384 1 aesni_intel snd_timer 36864 2 snd_seq,snd_pcm sysfillrect 16384 1 drm_kms_helper snd 86016 21 snd_hda_codec_generic,snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi sysimgblt 16384 1 drm_kms_helper snd_pci_acp3x 16384 0 k10temp 16384 0 soundcore 16384 1 snd mac_hid 16384 0 sch_fq_codel 20480 2 parport_pc 36864 0 ppdev 24576 0 lp 20480 0 parport 53248 3 parport_pc,lp,ppdev ip_tables 32768 0 x_tables 40960 1 ip_tables autofs4 45056 2 ahci 40960 2 libahci 32768 1 ahci r8169 86016 0 i2c_amd_mp2_pci 20480 0 i2c_piix4 28672 0 realtek 20480 0 video 49152 0 sdhci_acpi 24576 0 sdhci 57344 1 sdhci_acpi i2c_hid 28672 0 hid 126976 3 i2c_hid,usbhid,hid_generic
Any help with this would be greatly appreciated! Thankyou very much in advance!
John Utz Pensar Development
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 32
Oh hey! Just the person I was hoping to hear from! Thankyou for your efforts Mr. Skeely!
I have an update: Bad news: It still repos in 3.00
Good news: If i move libhsa-ext-image64.so* the crash no longer happens, clinfo, OpenCV::DNN works, etc.
Analysis: As near as I can tell the crash is happening in hsa_amd_image_get_info_max_dim_impl in libhsa-ext-image64.so.1.1.9 i say this because i think i am single stepping into this function before the segfault is tossed.
This failure is on Ryzen CPU / RavenRidge GPU. I assume the problem doesnt repro on desktop class GPUs.
What i realized this morning is that i dont know which of these query args is triggering it - do any of them look like they would be immediately broken on RavenRidge: