TorchSharp: Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) )
Repeated execution of Torchscript modules leads to consistent failures of one of these varieties, apparently due to memory corruption (within about 30 seconds of runtime).
This has now been narrowed down to a fairly minimal repro. This hangs or gives CLR fatal error within about 20 seconds, repeatedly running a medium sized (26mm parameter) neural network and interleaving with exercising .NET garbage allocation.
This is running latest version of Torchsharp (0.100.3) on Windows 11 with CUDA 12.2. The TorchScript file is too large to easily upload (56MB) and possibly this would happen with any TorchScript module.
public static void ReplicateFailure()
{
var module = torch.jit.load<Tensor, (Tensor, Tensor, Tensor, Tensor)>(@"c:\temp\fail.ts", DeviceType.CUDA, 0).to(ScalarType.Float16);
for (int bs = 1; bs < 133; bs += 3)
{
Tensor input = torch.tensor(new float[bs * 64 * 135], ScalarType.Float16, new Device("cuda:0"), false)
.reshape(new long[] { bs, 64, 135 });
for (int j = 0; j < 100; j++)
{
using (var dx = torch.NewDisposeScope())
{
module.call(input);
}
ExerciseGC(j);
}
Console.WriteLine(bs + " done loop this batch size ");
}
}
public static void ExerciseGC(int index)
{
Console.WriteLine("in " + index);
object[] objs = new object[20_000];
for (int i = 0; i < 20_000; i += 999)
{
objs[i] = new byte[i * 10];
}
Console.WriteLine("out");
}
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 29 (17 by maintainers)
Commits related to this issue
- Fixed issue #1047. — committed to NiklasGustafsson/TorchSharp by NiklasGustafsson 8 months ago
- Merge pull request #1133 from NiklasGustafsson/bugs Addressing issue #1047 — committed to dotnet/TorchSharp by NiklasGustafsson 8 months ago
Okay, so by allocating and freeing memory in native code, instead of pinning managed objects, it both faster and can run your repro case (and a few others) 1,000,000 times.
It takes a very long time (15 minutes) to do so, so I’m going to have to use a lower number of iterations for the checked-in unit test.
Good news!
Regarding your challenge, GPT4 produced a plausible solution, but I’m not qualified to judge if it’s just a hallucination…
… Building cross-platform C++ projects can indeed be challenging due to differences in compiler behavior and available features. To address your issue with setting compiler options for C++ files only, you can use the $<COMPILE_LANGUAGE:CXX> generator expression in CMake. This expression evaluates to true for C++ sources.
You can set the C++ standard for your target using the target_compile_features function, and for setting specific compiler options, you can use target_compile_options. Here is an example of how you might adjust your CMakeLists.txt:
cmake
In this example:
Replace your_target_name with the actual name of your target. CXX_STANDARD is used to set the required C++ standard. CXX_STANDARD_REQUIRED ensures that CMake will fail configuration if the required standard is not supported. target_compile_options is used to add compile options conditionally for C++ files. These commands ensure that the C++ standard and compiler options are applied only to C++ files, not to C files. Adjust the options as needed for your project’s requirements.
Status update on 2.1.0 work: I have it working on Windows, with CUDA, but I’m having a challenging time getting the Linux and MacOS builds to build. libtorch has moved to C++ 17, and enabling it creates problems because CLANG (unlike MSVC) doesn’t simply ignore the command line option for the one file that is C rather than C++, so it rejects it.
The problem is in the src/Native/CMakeList.txt file, where I don’t know how to add the compiler option for just C++, excluding C.
So if anyone has some bright ideas on that one, I’m all ears.
Thank you for the update! I was also going to ask, this is definitely an impediment for my work too.
I’m sorry it’s proving so much a mystery. Here’s one radical thought - maybe you could try upgrading to more recent Pytorch (2.1) and CUDA (12.x) versions. This could have two benefits, it might fix the problem if it was an old bug or at least behave differently and give more clues. Additionally, a package upgrade would be very helpful (especially since the current one does not support modern hardware such as H100, reporting “no kernel found for device” when Torchsharp is run.