bladebit: Periodic panic when cuda plotting with 3.1.0-rc2

When running cudaplot and producing c07 plots I get a panic fairly regularly. Here’s what I see:

Seed used: 0xe323c0f230a83863a37cb136b4db4c88d600c1cbff549e5907eaec678b02d71e
Proofs requested/fetched: 35 / 100 ( 35.000% )
Proof fetches failed    : 60 ( 60.000% )

WARNING: Deleting plot '/mnt/plots/plot-k32-c07-2023-09-24-01-03-e451f4bc253ca772d5c941fb7ed71cbad1907c5710c67048c2d880675bf2256d.plot.tmp' as it failed to fetch some proofs. This might indicate corrupt plot file.

Completed writing plot in 72.75 seconds
Generating plot 8: 703a66d5a19a1173b863fe3d2ed1fe562aaf8d8ca74826846ddaa81c60088f6e
Plot temporary file: /mnt/plots/plot-k32-c07-2023-09-24-01-10-703a66d5a19a1173b863fe3d2ed1fe562aaf8d8ca74826846ddaa81c60088f6e.plot.tmp

CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

*** Panic!!! *** Fatal Error:
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered.
/home/llowrey/bladebit_cuda(_ZN7SysHost14DumpStackTraceEv+0x3b)[0x4c7a4b]
/home/llowrey/bladebit_cuda(_Z9PanicExitv+0x9)[0x6450b9]
/home/llowrey/bladebit_cuda[0x47675b]
/home/llowrey/bladebit_cuda(_ZN14CudaK32Plotter3RunERK11PlotRequest+0x5f3)[0x47b8d3]
/home/llowrey/bladebit_cuda(main+0xa67)[0x473ea7]
/lib64/libc.so.6(+0x27510)[0x7fb72b224510]
/lib64/libc.so.6(__libc_start_main+0x89)[0x7fb72b2245c9]

I also see this in dmesg:

[33389.606189] NVRM: GPU at PCI:0000:06:00: GPU-24106a8f-6cbb-0623-97ed-f00643abf6ac
[33389.606226] NVRM: Xid (PCI:0000:06:00): 31, pid=13181, name=bladebit_cuda, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7fb7_2bba8000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

I’m running --check 100 --check-threshold 0.8 and while I do see plots deleted periodically it’s only when the line starting with Proof fetches failed is output that the panic then immediately happens.

I have two identical systems (as described below) and both panic at about the same frequency which is between 10 and 20 plots. When I plotted with the alphas and then 3.0.0 I had about 7.75% of plots turn out to be bad. That’s about 1 every 13. That’s very consistent with what I’m seeing with panics every 10-20 plots.

CPU: Opteron 32c RAM: 256GB DDR3 ECC GPU: 1070 (PCIe2 x16 due to old Opteron platform) OS: Fedora 37 Kernel: 6.4.9-100.fc37.x86_64 Driver: 535.86.10 CUDA: 12.2 Bladebit: 3.1.0-rc2

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 16 (3 by maintainers)

Most upvoted comments

It may be that the CI artifacts might have issues. So we can try with a different executable. I will post here when I have one ready for testing, to see if that is the cause of the discrepancy