djl: MXNET GPU CUDA missmatch Problem?

hey i m new here ,today i have successfully install ijava kernel in google colab and java is running successfully BUT when i train i got this ERROR : ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: Compile with USE_CUDA=1 to enable GPU usage Stack trace: File "src/storage/storage.cc", line 119 at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1909) at ai.djl.mxnet.jna.JnaUtils.createNdArray(JnaUtils.java:349) at ai.djl.mxnet.engine.MxNDManager.create(MxNDManager.java:91) at ai.djl.mxnet.engine.MxNDManager.create(MxNDManager.java:34) at ai.djl.ndarray.NDManager.create(NDManager.java:526) at ai.djl.mxnet.engine.MxNDArray.duplicate(MxNDArray.java:184) at ai.djl.mxnet.engine.MxNDArray.toDevice(MxNDArray.java:197) at ai.djl.training.ParameterStore.getValue(ParameterStore.java:110) at ai.djl.training.Trainer.lambda$initialize$1(Trainer.java:120) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at ai.djl.training.Trainer.initialize(Trainer.java:117) at .(#76:1) heres my training phase:

int batchSize = 32; int limit = Integer.MAX_VALUE; // change this to a small value for a dry run // int limit = 160; // limit 160 records in the dataset for a dry run Pipeline pipeline = new Pipeline( new ToTensor(), new Normalize(new float[] {0.4914f, 0.4822f, 0.4465f}, new float[] {0.2023f, 0.1994f, 0.2010f})); Cifar10 trainDataset = Cifar10.builder() .setSampling(batchSize, true) .optUsage(Dataset.Usage.TRAIN) .optLimit(limit) .optPipeline(pipeline) .build(); trainDataset.prepare(new ProgressBar()); DefaultTrainingConfig config = new DefaultTrainingConfig(Loss.softmaxCrossEntropyLoss()) //softmaxCrossEntropyLoss is a standard loss for classification problems .addEvaluator(new Accuracy()) // Use accuracy so we humans can understand how accurate the model is .optDevices(new Device[]{Device.gpu(0)}) // Limit your GPU, using more GPU actually will slow down coverging .addTrainingListeners(TrainingListener.Defaults.logging()); // Now that we have our training configuration, we should create a new trainer for our model Trainer trainer = model.newTrainer(config); int epoch = 10; Shape inputShape = new Shape(1, 3, 32, 32); trainer.initialize(inputShape); for (int i = 0; i < epoch; ++i) { int index = 0; for (Batch batch : trainer.iterateDataset(trainDataset)) { EasyTrain.trainBatch(trainer, batch); trainer.step(); batch.close(); } // reset training and validation evaluators at end of epoch trainer.notifyListeners(listener -> listener.onEpoch(trainer)); }`

I know it CUDA related error in google colab its showing cuda-10.0 installed i have alse tried installing mxnet-cu90 using this Cmd: !pip install mxnet-cu90 Still not working … Please help me through this ??

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 28 (12 by maintainers)

Most upvoted comments

@aksrajvanshi as a further action, can you update the colab instruction in D2L book so we can automate this process next time?

lanking520 on Apr 9, 2021

Awesome, thanks! Please just let me update if you change or Do anything special with colab so that I can adapt quickly …I m gladly looking forward to see it.

On Fri 9 Apr, 2021, 8:57 PM aksrajvanshi, @.***> wrote:

@nikkisingh111333 https://github.com/nikkisingh111333 Great! So this wasn’t exactly your problem. This is more of a colab problem. First of all, MXNet needed Cuda 10.1 or 10.2 to work.

Secondly, DJL needs the libcudart file at the $LD_LIBRARY_PATH environment variable which wasn’t there. Initially LD_LIBRARY_PATH was pointing to /usr/lib64-nvidia. We created a symbolic link that allowed DJL to locate the libcudart file.

We can try to do something which would make it easy for users to use DJL on colab along with GPU 😃

Also, if you’re starting with Deep Learning, you can start with this book. (https://d2l.djl.ai/)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/824#issuecomment-816763189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3BT7A4TIEO7FCV2QPAWWLTH4MFTANCNFSM42NFUF7Q .

nikkisingh111333 on Apr 9, 2021