deeplearning4j: cudaStreamSynchronize(...) failed
Issue Description
Possible dupe of #6892.
When running an LSTM network on GPU, it crashes with a RuntimeException:
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ"
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ"
CUDA error at /mnt/jenkins/workspace/deeplearning4j-master-linux-x86_64-cuda-10.0/libnd4j/blas/cuda/NativeOps.cu:1904 code=77(<unknown>) "dZ"
Exception in thread "main" java.lang.RuntimeException: cudaStreamSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:2725)
at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1378)
at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1338)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.graph.ComputationGraph.doTruncatedBPTT(ComputationGraph.java:3612)
at org.deeplearning4j.nn.graph.ComputationGraph.fitHelper(ComputationGraph.java:1153)
at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1112)
at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1079)
at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1015)
at com.luiwammes.pc.trainer.Application.main(Application.java:71)
Suppressed: java.lang.RuntimeException: cudaStreamSynchronize(...) failed
at org.nd4j.nativeblas.Nd4jCuda$NativeOps.streamSynchronize(Native Method)
at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:131)
at org.nd4j.linalg.jcublas.context.CudaContext.syncOldStream(CudaContext.java:121)
at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.commit(CudaExecutioner.java:1868)
at org.nd4j.linalg.memory.abstracts.Nd4jWorkspace.close(Nd4jWorkspace.java:602)
at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1419)
... 10 more
Additional output:
10:19:35.276 [main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
10:19:40.482 [main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 32
10:19:48.892 [main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 0
10:19:48.909 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Linux]
10:19:48.909 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Cores: [4]; Memory: [4.9GB];
10:19:48.909 [main] INFO o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [CUBLAS]
10:19:48.911 [main] INFO o.n.l.j.o.e.CudaExecutioner - Device Name: [Tesla K80]; CC: [3.7]; Total/free memory: [11996954624]
10:19:48.983 [main] INFO o.d.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
10:19:59.515 [main] INFO com.luiwammes.pc.trainer.Application -
========================================================================================================================
VertexName (VertexType) nIn,nOut TotalParams ParamsShape Vertex Inputs
========================================================================================================================
input (InputVertex) -,- - - -
layer-0 (LSTM) 80,512 1214464 W:{80,2048}, RW:{512,2048}, b:{1,2048} [input]
layer-1 (LSTM) 512,512 2099200 W:{512,2048}, RW:{512,2048}, b:{1,2048} [layer-0]
outputLayer-merge (MergeVertex) -,- - - [layer-0, layer-1]
outputLayer (RnnOutputLayer) 1024,80 82000 W:{1024,80}, b:{1,80} [outputLayer-merge]
------------------------------------------------------------------------------------------------------------------------
Total Parameters: 3395664
Trainable Parameters: 3395664
Frozen Parameters: 0
========================================================================================================================
Version Information
Using latest snapshots for dl4j/nd4j.
Nvidia stuff:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
$ nvidia-smi
Fri Apr 19 10:25:30 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:04.0 Off | 0 |
| N/A 54C P8 31W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
OS
$ uname -a
Linux gpu-instance 4.15.0-1029-gcp #31-Ubuntu SMP Thu Mar 21 09:40:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 31 (15 by maintainers)
@saudet confirmed and fixed in my branch. Will merge with other fixed later today.