DCGM: dcgm-exporter crashes hostengine.
Running a 3.3.5-3.4.0
exporter on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine.
Is there something I can do? Shour that be reported to the exporter instead?
Logs:
dmesg crash info
Feb28 16:22] nvidia-nvswitch5: open (major=510)
[ +0,042810] nvidia-nvswitch4: open (major=510)
[ +0,042606] nvidia-nvswitch0: open (major=510)
[ +0,042409] nvidia-nvswitch2: open (major=510)
[ +0,042448] nvidia-nvswitch1: open (major=510)
[ +0,042372] nvidia-nvswitch3: open (major=510)
[Feb28 16:29] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000]
[ +0,000008] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41
[ +0,155916] nvidia-nvswitch3: release (major=510)
[ +0,000005] nvidia-nvswitch1: release (major=510)
[ +0,000002] nvidia-nvswitch2: release (major=510)
[ +0,000003] nvidia-nvswitch0: release (major=510)
[ +0,000002] nvidia-nvswitch4: release (major=510)
[ +0,000002] nvidia-nvswitch5: release (major=510)
journal for exporter and hostengine
Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service.
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: DCGM initialized
Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555
Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter.
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'"
Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core"
Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected"
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'.
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV
Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'.
Versions
# dcgm-exporter -v --debug
DCGM Exporter version 3.3.5-3.4.0
# dcgmi -v
Version : 3.3.5
Build ID : 14
Build Date : 2024-02-24
Build Type : Release
Commit ID : 93088b0e1286c6e7723af1930251298870e26c19
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 08a0d9624b562a1342bf5f8828939294
apt-cache policy datacenter-gpu-manager
# apt-cache policy datacenter-gpu-manager
datacenter-gpu-manager:
Installed: 1:3.3.5
Candidate: 1:3.3.5
Version table:
*** 1:3.3.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
100 /var/lib/dpkg/status
1:3.3.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.3.1 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.3.0 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.2.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.7 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.1.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:3.0.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.4.8 600
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.4.7 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.4.6 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.4.5 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.6 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.5 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.3.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.3.2 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.3.1 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.2.9 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.2.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.2.3 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.8 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.7 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.1.4 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.0.15 580
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
1:2.0.14 600
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
1:2.0.13 600
580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages
600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal/common amd64 Packages
OS info
# cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2020-10-26-11-53-11"
DGX_SWBUILD_VERSION="5.0.0"
DGX_COMMIT_ID="7501dff"
DGX_PLATFORM="DGX Server for DGX A100"
DGX_SERIAL_NUMBER="XXXXXXXXXXXX"
DGX_OTA_VERSION="5.0.5"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.1.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.2.0"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.3.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
DGX_OTA_VERSION="5.5.1"
DGX_OTA_DATE="XXXXXXXXXXXXXXXXX"
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 15
@krono , I apologize for the long wait. We’ve managed to reproduce the issue on our side. While our call stack is different, the source of the problem is very likely to be the same and your observations on std::vector<> with garbage supports it. I believe that the fix that we’re working on will resolve it.