llama.cpp: Docker Issus ''Illegal instruction''

I try to make it run the docker version on Unraid,

I run this as post Arguments: --run -m /models/7B/ggml-model-q4_0.bin -p "This is a test" -n 512

I got this error: /app/.devops/tools.sh: line 40: 7 Illegal instruction ./main $arg2

Log:

main: seed = 1679843913
llama_model_load: loading model from '/models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
/app/.devops/tools.sh: line 40:     7 Illegal instruction     ./main $arg2

I have run this whitout any issus: --all-in-one "/models/" 7B

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 24 (12 by maintainers)

Most upvoted comments

On a project with a million dependencies an libraries this might be a problem, but as there is no dependencies and builds on anything and thus compilation shouldn’t pose a problem nor take more time than a few seconds

A case for runtime detection:

In any reasonable, modern cloud deployment, llama.cpp would end up inside a container. In fact, being CPU-only, llama enables deploying your ML inference to something like AWS Lambda/GCP Cloud Run providing very simple, huge scalability for inference. All these systems use containerization and expect you to have pre-built binaries ready to go. Compiling at container launch is not really an option as that significantly increases cold-start/scale up latencies (a few seconds is too long).

However, the higher up the serverless stack you go, the less control you have over the CPU platform underneath. GCP, for example, has machines from Haswell era ++ all intermingled in and they don’t even document what to expect for Cloud Functions or Cloud Run.

I’m not a C expert by any means so not my wheelhouse to offer up a PR but the case for this is pretty strong IMO.

@slaren That explains why the binaries are so inconsistent. We don’t know what CPU’s the github runners are using, thus making the binaries un-usable

I didn’t say every configuration set. Unless the runtime detection has only negligible impact on performance I think it’s better for consumer to just get image optimized for their architecture. Obviously there is a point of diminishing returns. But even Intel provides optimized images for avx512 [0].

[0] hub.docker.com/r/intel/intel-optimized-tensorflow-avx512

Sure, you could do automatic separate images for AVX / AVX2 / AVX512 like the Windows releases by just editing the action, no code change necessary.

Or as the binaries are rather small, you could pack each of them in one image and add a simple script to have “launch-time” detection if you will, something like this:

#!/bin/sh

cpuinfo="$(cat /proc/cpuinfo)"
if [ $(echo "$cpuinfo" | grep -c avx512) -gt 0 ]; then
	./llama_avx512 "$@"
elif [ $(echo "$cpuinfo" | grep -c avx2) -gt 0 ]; then
	./llama_avx2 "$@"
else
	./llama_avx "$@"
fi

@gaby yes they can vary, but for compilation it doesn’t matter which cpu the runner has, only for tests. as you can see in the discussion how the windows builds always build avx512 but only test when its possible. if the docker builder looks at its own features when compiling the binaries, then its misconfigured. if i compile something for gameboy advance on my x86 pc, its not the features of my pc what i should choose when compiling. i’m not too familiar with docker but i suppose there has to be an option too which would not precompile binaries but rather have sources inside the container which would be compiled as the first step of installation.

but idk, the whole raison d’être for docker containers is to deal with the huge mess of interconnected dependencies in the linux world which are hard to deal with. but this project doesn’t contain any dependencies or libraries and can be simply built on any machine. so i don’t understand the value proposition of docker when it comes to this project at all, except the negative value of constantly having to deal with issues related to it.

if you are a absolute fan of docker and you just absolutely positively have to have it, the container could literally have a single .sh bash script which would do

git clone https://github.com/ggerganov/llama.cpp.git
make

and that’s it lol. the beauty of having no libraries and dependencies.

for precompiled binaries currently the only option is to build packages for different options like the windows releases. in the future a better option would be to detect the features at runtime though, unless it cant be done without a performance penalty but probably it can. it has to be researched a bit though because it would affect inlining which cannot be done when the codepaths arent static. if inlining achieves performance benefit then we gotta stick with the multi builds as speed > everything else.

It’s clear that as long as CPU features are determined at compile time, distributing binaries is going to cause problems like this.

Illegal instruction sounds like using a instruction which your processor does not support. I’ve touched on the issue in this discussion: