OpenBLAS: Performance: speed up GEMM by pre-faulting memory

I’ve found that at least in some situations, GEMM can be particularly slow if the output array is not already backed by physical pages, and it can be 4x faster to first memset the output array (the time for the memset is included in the measured time). I don’t know exactly why this should be - my guesses are that either multiple threads page-faulting is causing contention on some kernel lock, or the zero-filling the kernel does to provide fresh pages is trashing the cache. The effect is more significant when using huge (2MB) pages, which numpy does by default.

I haven’t experimented with other matrix shapes, types, transposition options etc, or other BLAS functions, so I’m not sure if this is widespread or peculiar to my use case. But at least for compute-dense level 3 operations

Here’s some sample code to demonstrate the problem. It requires 8GB of memory; if that’s too much, try reducing rep. Edit the #define’s at the top of the file to try the different options. Requires C++11 and Linux (to get control over huge pages; let me know if you’d like a more portable version). I’m compiling with g++ -Wall -g -O3 -std=c++14 -march=native; I’m using OpenBLAS a6c2cb8 with USE_OPENMP=1.

I get (on a 6-core / 12-thread i7-9750H):

USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 6.8s
USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 1.7s
USE_HUGE_PAGES=0, MEMSET_IN_LOOP=0: 6.2s
USE_HUGE_PAGES=0, MEMSET_IN_LOOP=1: 4.0s

On the other hand, with OMP_NUM_THREADS=1, I get

USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 3.9s
USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 4.2s

So it’s not guaranteed that doing the memset will help. Possibly it should be done when multithreading is going to happen (and touching only every 4096th byte would probably be cheaper and cause less cache thrashing).

#include <iostream>
#include <memory>
#include <chrono>
#include <complex>
#include <algorithm>
#include <cblas.h>
#include <sys/mman.h>

// Use madvise to request huge pages
#define USE_HUGE_PAGES 1
// Memset the output before entering the timing loop
#define MEMSET_FIRST 0
// Memset the output for each GEMM, inside the timing loop
#define MEMSET_IN_LOOP 0

static const size_t n = 1024;
static const size_t m = 1024;
static const size_t k = 32;
static const int rep = 1000;

struct cplex
{
    float re;
    float im;
};

class munmap_deleter
{
private:
    const std::size_t size;

public:
    explicit munmap_deleter(std::size_t size) : size(size) {}

    void operator()(cplex *data)
    {
        munmap(data, size);
    }
};

typedef std::unique_ptr<cplex[], munmap_deleter> data_ptr;

static data_ptr allocate(std::size_t elements)
{
    void *raw = mmap(nullptr, elements * sizeof(cplex), PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (raw == MAP_FAILED)
        throw std::bad_alloc();
    data_ptr out{(cplex *) raw, munmap_deleter(elements)};
#if USE_HUGE_PAGES
    madvise(out.get(), elements * sizeof(cplex), MADV_HUGEPAGE);
#endif
    return out;
}

int main()
{
    auto a = allocate(n * k * rep);
    auto b = allocate(k * m * rep);
    auto c = allocate(n * m * rep);
    std::fill(a.get(), a.get() + n * k * rep, cplex{1.0f, 0.5f});
    std::fill(b.get(), b.get() + k * m * rep, cplex{1.0f, 2.0f});
#if MEMSET_FIRST
    memset(c.get(), 0, n * m * rep * sizeof(cplex));
#endif
    const cplex alpha = {1.0f, 0.0f};
    const cplex beta = {0.0f, 0.0f};

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < rep; i++)
    {
#if MEMSET_IN_LOOP
        memset(c.get() + i * (n * m), 0, n * m * sizeof(cplex));
#endif
        cblas_cgemm(
            CblasRowMajor, CblasNoTrans, CblasNoTrans,
            n, m, k, &alpha,
            a.get() + i * (n * k), k,
            b.get() + i * (k * m), m,
            &beta,
            c.get() + i * (n * m), m);
    }
    auto stop = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = stop - start;
    std::cout << "USE_HUGE_PAGES: " << USE_HUGE_PAGES << '\n';
    std::cout << "MEMSET_FIRST: " << MEMSET_FIRST << '\n';
    std::cout << "MEMSET_IN_LOOP: " << MEMSET_IN_LOOP << '\n';
    std::cout << elapsed.count() << '\n';
    return 0;
}

About this issue

Original URL
State: open
Created 3 years ago
Comments: 22 (12 by maintainers)

Most upvoted comments

Interesting, but I wonder how big the parameter space is here - architecture, kernel version, libc version

… number of threads, threading library, BLAS function, data type, matrix sizes, transpositions… it is rather intimidating, and unfortunately too big for me to consider volunteering to explore it with any thoroughness, particularly since I know very little about OpenBLAS and only interact with it via numpy.

You can disable vm overcommitment in various ways.

Sure, and I’m working around this problem by pre-faulting. I filed this because in an ideal world (in which OpenBLAS developers have infinite time), users wouldn’t need to make these decisions, particularly if the optimal choices depend on OpenBLAS internals like which problems sizes will be multi-threaded. I don’t really know how close to that ideal world OpenBLAS is though.

bmerry on Apr 28, 2021