OpenBLAS: Multi-Thread & Speed Problem

Hi, everyone I am not sure that if the speed I got is the quickest result, so I wanna make for sure here, and in my opinion, It can be faster. I ran a test.c like this:

#include <stdio.h>
#include "/usr/local/openblas_a9/include/cblas.h"
#include <time.h>
#include <stdlib.h>
#define random(x)(rand()%x)

void main()
{
   int M=160*160,K=32*9,N=64;
   int i;
   float* A=(float*)malloc(sizeof(float)*M*K);
   for(i=0;i<M*K;i++)
   {
      A[K]=(float)random(10);
    }

   float* B=(float*)malloc(sizeof(float)*N*K);
   for(i=0;i<N*K;i++)
   {
      B[K]=(float)random(10);
    }
   float* C=(float*)malloc(sizeof(float)*N*M);
   
   float alpha=1;
   float beta=0;
   int lda=K;
   int ldb=N;
   int ldc=N;
   
   clock_t start1,end1;
   start1=clock();
   cblas_sgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,M,N,K,alpha,A,lda,B,ldb,beta,C,ldc);
   end1=clock();
   printf("timeusing:%lf",((double)end1 - (double)start1)/ CLOCKS_PER_SEC);
   printf("\n");
}

Before I compiled it I tried to use the multi-threads, I inputed this in my ubuntu teminal export OPENBLAS_NUM_THREADS=4 Then I compiled this test.c by using this instruction : arm-linux-gnueabihf-gcc -static -o test test.c -I /usr/local/openblas_a9/include/ -L /usr/local/openblas_a9/lib /usr/local/openblas_a9/lib/libopenblas.a -lpthread -lgfortran -std=c99 -mcpu=cortex-a9 -mfpu=neon-fp16 -mfloat-abi=hard -O3 -ffast-math

The CPU that run the executable file is exynos4412 cortex-a9

the time that cblas_sgemm used is around 0.48s. If I didn’t input export OPENBLAS_NUM_THREADS=4 the time is 0.68s

I wanna know that if I did something wrong when I use openblas to caculate the matrix and if the speed is reasonable, if not , how can I speed up?

Thanks for your attention!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 27 (3 by maintainers)

Most upvoted comments

Pretty sure now that by using clock() you are measuring the sum of processor time across all threads. When I rewrite it to calculate elapsed time rather than cpu time, single thread is still at 0.29s but with four threads it completes in 0.09s (0.16 for two threads, 0.12 with three threads). This is with the small changes from the old “optimized for deep learning” branch, i.e. https://github.com/xianyi/OpenBLAS/commit/92058a75e2cc1e46e73e2784f691a2bcb2f9aef9 (I believe I can only test with softfp on my Tinkerboard running Linaro, but I see no reason why hardfp would be different).