dlib: Training gets stuck on GeForce RTX 2080 Ti

I’m training semantic-segmentation networks, along the lines of #288.

The code works great on GeForce GTX products, but with new RTX hardware training gets stuck randomly. Looks like a race condition or similar, because the freeze happens after multiple iterations and usually after different number of iterations (from run to run).

Using CUDA 10 and cuDNN 7.3 on 64-bit Windows (but same issue with CUDA 8 and cuDNN 5). Latest master from GitHub. MSVC debugger shows the code is waiting on this line.

Because steps to reproduce include acquisition of RTX hardware, I’ll rather spend time trying to debug the issue than writing a complete set of steps – at least at this point.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 24 (23 by maintainers)

Commits related to this issue

Problem: see #1513 (https://github.com/davisking/dlib/issues/1513) Candidate solution: busy-loop until cudaStreamQuery returns cudaSuccess — committed to reunanen/dlib by reunanen 6 years ago
Problem: see #1513 (https://github.com/davisking/dlib/issues/1513) Candidate solution: busy-loop until cudaStreamQuery returns cudaSuccess — committed to reunanen/dlib by reunanen 6 years ago
Problem: see #1513 (https://github.com/davisking/dlib/issues/1513) Candidate solution: busy-loop until cudaStreamQuery returns cudaSuccess — committed to reunanen/dlib by reunanen 6 years ago
Apparently fix #1513 by avoiding the cudaStreamSynchronize call (#1514) * Problem: see #1513 (https://github.com/davisking/dlib/issues/1513) Candidate solution: busy-loop until cudaStreamQuery retur... — committed to davisking/dlib by reunanen 6 years ago
Try to fix #1513 even more by circumventing the remaining cudaStreamSynchronize calls — committed to reunanen/dlib by reunanen 6 years ago
Try to fix #1513 even more by circumventing the remaining cudaStreamSynchronize calls — committed to reunanen/dlib by reunanen 6 years ago
Apply the #1514 fix even on non-Windows platforms, and change all remaining cudaStreamSynchronize calls (#1596) * Apply the #1514 fix even on non-Windows platforms * Try to fix #1513 even more by ... — committed to davisking/dlib by reunanen 5 years ago
Update dlib in order to fix https://github.com/davisking/dlib/issues/1513 — committed to reunanen/annonet by reunanen 5 years ago
Apparently fix #1513 by avoiding the cudaStreamSynchronize call (#1514) * Problem: see #1513 (https://github.com/davisking/dlib/issues/1513) Candidate solution: busy-loop until cudaStreamQuery retur... — committed to kapanu/dlib by reunanen 6 years ago
Apply the #1514 fix even on non-Windows platforms, and change all remaining cudaStreamSynchronize calls (#1596) * Apply the #1514 fix even on non-Windows platforms * Try to fix #1513 even more by ... — committed to kapanu/dlib by reunanen 5 years ago

Most upvoted comments

I was running into a similar issue, after I upgraded my older GTX770 GPU to an RTX4090 which I had alongside my RTX2070 Super, I believe I was using the drivers for the 770 when the 2070 and 770 were installed side-by-side with an older version of CUDA (10.2)

In my particular case, the upgrade resulted in memory access violations at cudaStreamSynchronize, changing the source code to use the polling method introduced by this comment thread would also result in access violations so I went down a pretty terrible rabbit hole which I will try to document in this comment in the hopes that somebody finds an actual fix for this, the TLDR was that calling get_net() before train_one_step(batch_in, batch_target) in my training loop finally avoided all issues.

To begin, I am on Windows 11 (regrettably) using CUDA 12.1 on the RTX 4090 and 2070 cards, I believe having the 4090 has forced me to only use 12.1 since according to the compute-capability compatibility (say that fast 20 times) matrix that is the only one that supports cc 8.9.

I made sure, in my case, to install CUDA 12.1 (with the most recent NVidia drivers) and build dlib from source using the correct compute capability (8.9 and 7.5 for the 4090 and 2070). I installed NSight Compute and got a trace:

Before getting to the memory error I had a series of 40-60 error 209 (oddly only some of the cuLibraryGetModule API calls returned an error 209, and the commonality of every error was that the first parameter passed was a memory address with no offset ( image (3)

After clicking through these error 209s I eventually get to a memory access violation: image (1)

After going through dlib source code, and not really finding anything, I got really frustrated and needed to understand where the issue was coming from. I have a similar implementation of my code in pytorch, so I spent considerable time building from source (and upgrading since the cmake files are not compatible with the migration of NVidia nvToolsExt to header only in CUDA 12.1)

Again, making sure I was using the same cuda libraries and compute-capability to eliminate different versions as a possibility for success/failure across pytorch/dlib.

What I observed in pytorch with CUDA 12.1 on my RTX4090 and RTX2070, was very similar to what I was seeing in dlib. It was a bit more stable in pytorch, in the sense that when the application got past a certain point it was almost guaranteed to complete, whereas in dlib it would always crash by iteration ~20000. But I most definitely observed the same errors in pytorch (no pictures of the memory error trace yet, it’s a bit of a pain to step through all those error 209s, especially when it does not happen every time… also not really motivated to do so since I found a fix, even if it is terrible)

For your consideration here is the pytorch 209 error:

Screenshot_20230403-072113_Remote Desktop

The thing that finally worked for me was calling get_net before train_one_step in my training loop, it does not seem like it is a requirement based on some examples like dnn_introduction2_ex. So I’m not sure why adding that particular line of code finally got things working for dlib / CUDA 12.1 on Windows 11 @davisking, I have a hunch why this worked, but would love other peoples input.

BONUS CONTENT: Before the final fix, I had things semi-stable for about a week (it would complete a few runs before the memory access violations). But then last Friday a storm caused the lights to go out and when it finally got back on the application would consistently give memory errors after ~20000 iterations. That led me to check the core GPU voltage, thinking maybe something got fried? But core voltage was fine 0.88V under no load, 0.965V under load (card also seemed fine, can play skyrim on it with no issue) at which time I got frustrated and went down the above rabbit hole.

All signs point to a race condition somewhere in the CUDA libraries, or a change in the cuda api that dlib has not implemented yet?

sshazly on Apr 4, 2023

Just wanted to give you all an update on this. I just ran into this problem 3 times out of about 57 different trials. Same architecture for each trial. The only difference now is that I’m seeing it on CUDA 9.0, for an IBM Power8 machine running RHEL7 with 4 P100’s. This was using dlib-19.15. I’m going to try the patch, but with ~5% chance of the stall happening I might not catch it.

davemers0160 on Dec 4, 2018

I’ve only run across the error on Windows 10. I haven’t tested any Win7 or Win8.1 machines. I also haven’t upgraded any Ubuntu machines to Cuda Toolkit v10.0.

davemers0160 on Oct 16, 2018

I was able to incorporate #1514 on my Quadro machine and was able to complete a training session (~5hrs). Before making the change on my GTX-1080 machine I upgraded the driver to 416.34, but this did not help, it still stalled within the first few minutes of training. I then made the 1514 change and was able to complete the training (~5hrs).

davemers0160 on Oct 15, 2018