TorchSharp: Version 0.99.6 training process is normal, 0.100.3 training process error

err [W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\third_party\nvfuser\csrc\graph_fuser.cpp:108] Warning: operator () profile_node %987 : int[] = prim::profile_ivalue(%dims.39) does not have profile information (function operator ())

[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\third_party\nvfuser\csrc\manager.cpp:340] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup)

[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\third_party\nvfuser\csrc\manager.cpp:340] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason.

To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup)

2023/7/3 0:23:21 ehope: 0/80000 : loss:0.048035335 acc:0.9519647 model directory: 【D:\Inno.LabelAssistantDatas\电路板\电路板\AIModel\model.pt】

D:\C#\Inno.ImgProcess.UI\Inno.AIFrameWork.TrainTest\bin\Debug\net6.0-windows\Inno.AIFrameWork.TrainTest.exe (process 22168) exited with code -1073741819. To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops. Press any key to close this window . . .

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (8 by maintainers)

Most upvoted comments

An interesting question!! Without using breakpoints, it crashes in 3 ephos, but with a large number of breakpoints, it can run at least 15 ephos.

I suspect it has something to do with GC.

@lindadamama , Can you prepare a simple example and dataset so that everyone can reproduce the problem. You currently provide too little information, it is difficult to analyze the problem.

Error message exited with code -1073741819. It should be that Torch2.0.1 occupies more memory than Torch1.13, causing a memory overflow, Another error is that some Tensors in dataloader were released due to unknown reasons during the training process of the training dataset,but there was no similar situation at 0.99.6 The main problem currently is the release of ResultMasks Var loss=criterion.call (result, item. ResultMasks); @GeorgeS2019 @NiklasGustafsson code of lost tensor :

 foreach (var item in dataloader)
                {
                    using (var d = torch.NewDisposeScope())
                    {
                        var output = this.module.call(item.Images);
                        if (output is Tensor result)
                        {
                            var loss = criterion.call(result, item.ResultMasks);
                            optimizer.zero_grad();
                            loss.backward();
                            var opt = optimizer.step();
                            var loss_val = loss.to(CPU).item<float>();
                            lossArry.Add(loss_val);
                            result?.Dispose();
                        }
                    };
                }

Is there a problem with the code written above? Or does it need to be changed to the following notation?

   using (var d = torch.NewDisposeScope())
                {
                    using var dataset = new SegDataset(datadir, labelDir, numClass);
                    using var dataloader = new torch.utils.data.DataLoader<DataSetItem, BatchItem>(dataset, batchSize, CollateFn.Collate, doShuffle, device, num_worker: 4);
                    foreach (var item in dataloader)
                    {
                        var output = this.module.call(item.Images);
                        if (output is Tensor result)
                        {
                            var loss = criterion.call(result, item.ResultMasks);
                            optimizer.zero_grad();
                            loss.backward();
                            var opt = optimizer.step();
                            var loss_val = loss.to(CPU).item<float>();
                            lossArry.Add(loss_val);
                            result?.Dispose();
                        }

                    }
                };

I’m sorry,I have modified it