llama.cpp: Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions

When I compile with make, the following error occurs

inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’: target specific option mismatch
   52 | _mm256_cvtph_ps (__m128i __A)

Error will be reported when executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o . But the error of executing cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o will not occur. Must -mavx be used with -mf16c?


OS: Arch Linux x86_64 Kernel: 6.1.18-1-lts

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 35

Most upvoted comments

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

It would be great if @xiliuya and @polkovnikov could work together to both create a pull request with your patches so we can support a wider range of CPUs.

_mm256_cvtph_ps requires the fp16c extension(?) see here

You need to add -mf16c to the build command

I believe that CPU supports only AVX, not AVX2. llama.cpp requires AVX2.

No, when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o to generate. o files, run make can run normally.

@gjmulder @xiliuya

I have this issue reported issue on my CPU. Apparently it has AVX, but no F16C (and no AVX2). I have quite old 10-15 year old Intel CPU on laptop.

Probably it is the case that some old CPUs have AVX while having no F16C.

I had this compilation issue on Windows latest 16-th Clang when provided -march=native. As you know arch native tells compiler to use all CPU features of current CPU, and it appears that it provides AVX feature but without F16C feature.

My compilation was fixed and program was working (although not to very fast) after I implemented this conversion functions myself and placed following code inside #elif defined(__AVX__) section of ggml.c:

#if defined(__AVX__) && !defined(__F16C__)
__m256 _mm256_cvtph_ps(__m128i x) {
    ggml_fp16_t const * src = &x;
    float dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP16_TO_FP32(src[i]);
    return *(__m256*)&dst;
}
__m128i _mm256_cvtps_ph(__m256 x, int imm) {
    float const * src = &x;
    ggml_fp16_t dst[8];
    for (int i = 0; i < 8; ++i)
        dst[i] = GGML_FP32_TO_FP16(src[i]);
    return *(__m128i*)&dst;
}
#endif

If some C/C++ gurus know faster implementation of this function for AVX then please tell here.

For know suggesting to put fix above into main branch by any volunteer. If code above is alright.

Try running:


grep -o 'avx2' /proc/cpuinfo

If it doesn’t print avx2 then AVX2 is not supported.

My CPU does not support avx2, but it can run normally through the above method.

Looks like a duplicate of #107. Can you please confirm you’re running on native x86_64 and not emulated?

Yes, not in a virtual environment such as docker. I will also report an error when I execute cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o on other machines.

CFLAGS := $(filter-out -mavx,$(CFLAGS))

We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level.

I made a patch and can make normally

diff --git a/Makefile b/Makefile
index 1601079..cf4a536 100644
--- a/Makefile
+++ b/Makefile
@@ -90,6 +90,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
                F16C_M := $(shell grep "f16c " /proc/cpuinfo)
                ifneq (,$(findstring f16c,$(F16C_M)))
                        CFLAGS += -mf16c
+               else ifneq (,$(findstring avx,$(AVX1_M)))
+                       CFLAGS := $(filter-out -mavx,$(CFLAGS))
                endif
                SSE3_M := $(shell grep "sse3 " /proc/cpuinfo)
                ifneq (,$(findstring sse3,$(SSE3_M)))

This patch allowed me to successfully run the make command.

@RiccaDS you can try merging #617, that should significant boost AVX1 performance.

Good work guys. I am not a C++ programmer…

I am however interested in performance. I’d ideally want the most performant CPU code for any arch.

If it is Arch I’m guessing you’re using a very recent g++ version. I know it compiles with g++10 under Debian and Ubuntu. We haven’t collected data on other g++ versions.