pytorch_dlprim: OpenCL is 3x slower

So I tried running this on the provided mnist.py. My main use case is that I’m on a laptop and I want it to train faster and use less battery. What I did not expect is that it was 3x slower using the GPU than using the CPU. I am on Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz with intel-opencl-icd. Furthermore, the GPU version caused the laptop fans to start spinning, but the CPU version didn’t.

Expected behaviour

(venv) home@daniel-tablet1:~/PycharmProjects/pytorch_dlprim$ python3 mnist.py --device cpu
/home/home/PycharmProjects/whisper/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.337603
Train Epoch: 1 [640/60000 (1%)] Loss: 1.000659
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.534252
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.306895
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.328317
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.169952
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.152536
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.228139
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.211874
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.113282
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.173121
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.139788
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.186882
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.099785
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.147153
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.190826
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.385145
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.154965
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.258187
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.147772
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.122823
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.150513
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.090943
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.208224
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.074682
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.091023
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.193498
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.048429
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.114691
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.103097
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.111526
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.026821
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.018942
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.079938
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.014885
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.042647
Train Epoch: 1 [23040/60000 (38%)]      Loss: 0.215288
Train Epoch: 1 [23680/60000 (39%)]      Loss: 0.138436
Train Epoch: 1 [24320/60000 (41%)]      Loss: 0.011650
Train Epoch: 1 [24960/60000 (42%)]      Loss: 0.028758
Train Epoch: 1 [25600/60000 (43%)]      Loss: 0.033963
Train Epoch: 1 [26240/60000 (44%)]      Loss: 0.026172
Train Epoch: 1 [26880/60000 (45%)]      Loss: 0.119763
Train Epoch: 1 [27520/60000 (46%)]      Loss: 0.122162
Train Epoch: 1 [28160/60000 (47%)]      Loss: 0.073711
Train Epoch: 1 [28800/60000 (48%)]      Loss: 0.031891
Train Epoch: 1 [29440/60000 (49%)]      Loss: 0.032309
Train Epoch: 1 [30080/60000 (50%)]      Loss: 0.058146
Train Epoch: 1 [30720/60000 (51%)]      Loss: 0.044536
Train Epoch: 1 [31360/60000 (52%)]      Loss: 0.023220
Train Epoch: 1 [32000/60000 (53%)]      Loss: 0.093438
Train Epoch: 1 [32640/60000 (54%)]      Loss: 0.022575
Train Epoch: 1 [33280/60000 (55%)]      Loss: 0.056749
Train Epoch: 1 [33920/60000 (57%)]      Loss: 0.030043
Train Epoch: 1 [34560/60000 (58%)]      Loss: 0.022908
Train Epoch: 1 [35200/60000 (59%)]      Loss: 0.084108
Train Epoch: 1 [35840/60000 (60%)]      Loss: 0.185571
Train Epoch: 1 [36480/60000 (61%)]      Loss: 0.017673
Train Epoch: 1 [37120/60000 (62%)]      Loss: 0.084662
Train Epoch: 1 [37760/60000 (63%)]      Loss: 0.080484
Train Epoch: 1 [38400/60000 (64%)]      Loss: 0.117529
Train Epoch: 1 [39040/60000 (65%)]      Loss: 0.003176
Train Epoch: 1 [39680/60000 (66%)]      Loss: 0.071565
Train Epoch: 1 [40320/60000 (67%)]      Loss: 0.108479
Train Epoch: 1 [40960/60000 (68%)]      Loss: 0.092688
Train Epoch: 1 [41600/60000 (69%)]      Loss: 0.048416
Train Epoch: 1 [42240/60000 (70%)]      Loss: 0.009381
Train Epoch: 1 [42880/60000 (71%)]      Loss: 0.038555
Train Epoch: 1 [43520/60000 (72%)]      Loss: 0.089673
Train Epoch: 1 [44160/60000 (74%)]      Loss: 0.020524
Train Epoch: 1 [44800/60000 (75%)]      Loss: 0.092968
Train Epoch: 1 [45440/60000 (76%)]      Loss: 0.068793
Train Epoch: 1 [46080/60000 (77%)]      Loss: 0.094527
Train Epoch: 1 [46720/60000 (78%)]      Loss: 0.154815
Train Epoch: 1 [47360/60000 (79%)]      Loss: 0.066463
Train Epoch: 1 [48000/60000 (80%)]      Loss: 0.037426
Train Epoch: 1 [48640/60000 (81%)]      Loss: 0.030952
Train Epoch: 1 [49280/60000 (82%)]      Loss: 0.013815
Train Epoch: 1 [49920/60000 (83%)]      Loss: 0.043523
Train Epoch: 1 [50560/60000 (84%)]      Loss: 0.044266
Train Epoch: 1 [51200/60000 (85%)]      Loss: 0.176199
Train Epoch: 1 [51840/60000 (86%)]      Loss: 0.024092
Train Epoch: 1 [52480/60000 (87%)]      Loss: 0.014346
Train Epoch: 1 [53120/60000 (88%)]      Loss: 0.038723
Train Epoch: 1 [53760/60000 (90%)]      Loss: 0.073435
Train Epoch: 1 [54400/60000 (91%)]      Loss: 0.017709
Train Epoch: 1 [55040/60000 (92%)]      Loss: 0.019962
Train Epoch: 1 [55680/60000 (93%)]      Loss: 0.106418
Train Epoch: 1 [56320/60000 (94%)]      Loss: 0.010950
Train Epoch: 1 [56960/60000 (95%)]      Loss: 0.023096
Train Epoch: 1 [57600/60000 (96%)]      Loss: 0.033030
Train Epoch: 1 [58240/60000 (97%)]      Loss: 0.007997
Train Epoch: 1 [58880/60000 (98%)]      Loss: 0.000659
Train Epoch: 1 [59520/60000 (99%)]      Loss: 0.001612
Epoch in  34.0s

Actual behaviour

(venv) home@daniel-tablet1:~/PycharmProjects/pytorch_dlprim$ python3 mnist.py --device opencl:0
/home/home/PycharmProjects/whisper/venv/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Accessing device #0:Intel(R) Iris(R) Plus Graphics [0x8a52] on Intel(R) OpenCL HD Graphics
Train Epoch: 1 [0/60000 (0%)]   Loss: 2.326378
Train Epoch: 1 [640/60000 (1%)] Loss: 1.373419
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.674224
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.342615
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.282575
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.321835
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.117600
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.174937
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.295922
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.179234
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.148632
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.247433
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.097251
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.170669
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.099438
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.183732
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.096929
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.091889
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.056076
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.081981
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.137648
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.124434
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.038791
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.150997
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.082680
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.044054
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.147787
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.047737
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.056453
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.023077
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.036574
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.011139
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.027549
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.028380
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.131590
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.192181
Train Epoch: 1 [23040/60000 (38%)]      Loss: 0.070133
Train Epoch: 1 [23680/60000 (39%)]      Loss: 0.124290
Train Epoch: 1 [24320/60000 (41%)]      Loss: 0.114533
Train Epoch: 1 [24960/60000 (42%)]      Loss: 0.011495
Train Epoch: 1 [25600/60000 (43%)]      Loss: 0.031055
Train Epoch: 1 [26240/60000 (44%)]      Loss: 0.058615
Train Epoch: 1 [26880/60000 (45%)]      Loss: 0.112524
Train Epoch: 1 [27520/60000 (46%)]      Loss: 0.029194
Train Epoch: 1 [28160/60000 (47%)]      Loss: 0.047580
Train Epoch: 1 [28800/60000 (48%)]      Loss: 0.022058
Train Epoch: 1 [29440/60000 (49%)]      Loss: 0.064951
Train Epoch: 1 [30080/60000 (50%)]      Loss: 0.081404
Train Epoch: 1 [30720/60000 (51%)]      Loss: 0.072505
Train Epoch: 1 [31360/60000 (52%)]      Loss: 0.096956
Train Epoch: 1 [32000/60000 (53%)]      Loss: 0.106381
Train Epoch: 1 [32640/60000 (54%)]      Loss: 0.018265
Train Epoch: 1 [33280/60000 (55%)]      Loss: 0.061221
Train Epoch: 1 [33920/60000 (57%)]      Loss: 0.070425
Train Epoch: 1 [34560/60000 (58%)]      Loss: 0.089722
Train Epoch: 1 [35200/60000 (59%)]      Loss: 0.151525
Train Epoch: 1 [35840/60000 (60%)]      Loss: 0.068132
Train Epoch: 1 [36480/60000 (61%)]      Loss: 0.011085
Train Epoch: 1 [37120/60000 (62%)]      Loss: 0.111000
Train Epoch: 1 [37760/60000 (63%)]      Loss: 0.040008
Train Epoch: 1 [38400/60000 (64%)]      Loss: 0.012150
Train Epoch: 1 [39040/60000 (65%)]      Loss: 0.059965
Train Epoch: 1 [39680/60000 (66%)]      Loss: 0.042966
Train Epoch: 1 [40320/60000 (67%)]      Loss: 0.109453
Train Epoch: 1 [40960/60000 (68%)]      Loss: 0.099907
Train Epoch: 1 [41600/60000 (69%)]      Loss: 0.073859
Train Epoch: 1 [42240/60000 (70%)]      Loss: 0.049867
Train Epoch: 1 [42880/60000 (71%)]      Loss: 0.033700
Train Epoch: 1 [43520/60000 (72%)]      Loss: 0.006360
Train Epoch: 1 [44160/60000 (74%)]      Loss: 0.051153
Train Epoch: 1 [44800/60000 (75%)]      Loss: 0.113450
Train Epoch: 1 [45440/60000 (76%)]      Loss: 0.008563
Train Epoch: 1 [46080/60000 (77%)]      Loss: 0.046368
Train Epoch: 1 [46720/60000 (78%)]      Loss: 0.089523
Train Epoch: 1 [47360/60000 (79%)]      Loss: 0.008030
Train Epoch: 1 [48000/60000 (80%)]      Loss: 0.237780
Train Epoch: 1 [48640/60000 (81%)]      Loss: 0.091529
Train Epoch: 1 [49280/60000 (82%)]      Loss: 0.022425
Train Epoch: 1 [49920/60000 (83%)]      Loss: 0.017645
Train Epoch: 1 [50560/60000 (84%)]      Loss: 0.022220
Train Epoch: 1 [51200/60000 (85%)]      Loss: 0.057755
Train Epoch: 1 [51840/60000 (86%)]      Loss: 0.016291
Train Epoch: 1 [52480/60000 (87%)]      Loss: 0.061722
Train Epoch: 1 [53120/60000 (88%)]      Loss: 0.046042
Train Epoch: 1 [53760/60000 (90%)]      Loss: 0.089375
Train Epoch: 1 [54400/60000 (91%)]      Loss: 0.017928
Train Epoch: 1 [55040/60000 (92%)]      Loss: 0.006611
Train Epoch: 1 [55680/60000 (93%)]      Loss: 0.012605
Train Epoch: 1 [56320/60000 (94%)]      Loss: 0.153086
Train Epoch: 1 [56960/60000 (95%)]      Loss: 0.037731
Train Epoch: 1 [57600/60000 (96%)]      Loss: 0.119136
Train Epoch: 1 [58240/60000 (97%)]      Loss: 0.029190
Train Epoch: 1 [58880/60000 (98%)]      Loss: 0.007807
Train Epoch: 1 [59520/60000 (99%)]      Loss: 0.051748
Epoch in  95.5s

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 27 (13 by maintainers)

Most upvoted comments

The pointwise/broadcast is a special kind of kernels that are shortcuts for common simple operation: activations, various reductions needed for normalization, loss functions etc.

Here you get the actual line of code and embed it into a kernel, like here: https://github.com/artyom-beilis/pytorch_dlprim/blob/master/src/pointwise_ops.cpp#L155

Now, I agree that this code can be optimised, but it is also important to remember that since the GPU and CPU executions are asynchronous, as long as you can push kernels to execution queue faster than GPU can run them you don’t bottleneck your system.

Dear Artyom, Please let me interest you with something more interesting as on an AMD 7900XTX. I freshly build a windows dlprim_flops and receive the following result: 7900XTX.txt

Best wishes, Jinchuan

Cool. It looks like my measurement of FLOPS are way incorrect it should be much more capable GPU according to wikipedia. That is why I get more than 100% flops for some GEMM ops.

1st of all I’m not sure that clblast has much advantage over dlprimitives. For example GEMM m=n=k=512 both give ~130-140 GFlops.

For example my gemm: Intel® UHD Graphics 630 [0x3e9b] on Intel® OpenCL HD Graphics When this GPU top GLOPS is around 401.413

dlprimitives sgemm

GEMM
  NN  0:  512,  512,  512      136.9 GFlops (34.09%)      1.6 GB/s (11.07%) limited by gflops 34.09%
  NT  0:  512,  512,  512      104.4 GFlops (26.01%)      1.2 GB/s ( 8.44%) limited by gflops 26.01%
  TN  0:  512,  512,  512      190.9 GFlops (47.56%)      2.2 GB/s (15.44%) limited by gflops 47.56%
  TT  0:  512,  512,  512      162.0 GFlops (40.35%)      1.9 GB/s (13.10%) limited by gflops 40.35%

clblast sgemm

NN: 143.772 GFLOPS
TN: 154.547
NT: 129.747
TT: 143.407

So it not necessary clear cut that clblast has better performance. Also note that when running convolition the im2col is actually integrated to the gemm itself so switching to clblast isn’t trivial at all.

On the same note I need to mention that Intel’s own deep learning library does not run well for channel first convolution since they just don’t optimise for one. https://github.com/oneapi-src/oneDNN/issues/1194 Actually dlprimitives performs better on channel 1st order.