llama.cpp: [User]Failed to execute any models on s390x

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Allow llama.cpp to be execute on s390x architecture

I am curious whether there is big endian/little endian issue of gguf model. My system is big endian.

BTW if you can point me how to add support for new sets of SIMD instructions, I can try to add s390x SIMD instructions support by myself.

Thank you.

Current Behavior

I can compile this program on s390x by commented k_quants.c line# 50. #if !defined(__riscv) //#include <immintrin.h> #endif

And I can execute ./main -h

But if I execute it with a real model, then I got invalid magic number. Is there an endianess issue?

[root@aiu llama.cpp]# ./main -m ./models/ggml-vocab-llama.gguf
Log start
main: build = 1265 (324f340)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1695309361
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from ./models/ggml-vocab-llama.gguf

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/ggml-vocab-llama.gguf'
main: error: unable to load model

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:        s390x
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Big Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s) per book:  2
Book(s) per drawer:  4
Drawer(s):           4
NUMA node(s):        1
Vendor ID:           IBM/S390
Machine type:        3931
CPU dynamic MHz:     5200
CPU static MHz:      5200
BogoMIPS:            3331.00
Hypervisor:          PR/SM
Hypervisor vendor:   IBM
Virtualization type: full
Dispatching mode:    horizontal
L1d cache:           128K
L1i cache:           128K
L2 cache:            32768K
L3 cache:            262144K
NUMA node0 CPU(s):   0-3
Flags:               esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt sie
  • Operating System, e.g. for Linux:

$ uname -a

Linux 4.18.0-305.el8.s390x #1 SMP Thu Apr 29 09:06:01 EDT 2021 s390x s390x s390x GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version

Python 3.9.2 GNU Make 4.2.1 Built for s390x-ibm-linux-gnu g++ (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8)

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

Cool, people are having IBM mainframes at home now 😃

The model magic is reversed, that is indeed endianness problem. This is either gonna be really easy, or particularly painful to solve.

You could try downloading a raw model, and converting/quantizing it directly on that particular machine.

@KerfuffleV2 Do you happen to know if convert scripts are endian-aware ? Supposedly pytorch library is, so in case convert can’t do that, maybe “reconverting” pytorch model would solve this ? What do you think ?

I fixed it at final.

@KerfuffleV2  Many thanks for the help. With the jouney, I did a lot of wrong practice, but got to the right direction at final.

I was using convert.py instead of convert-baichuan-hf-to-gguf.py.  What are the difference between them?

This script does not call add_tensor, but do same thing is write all function.

    @staticmethod
    def write_all(fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyModel, vocab: Vocab, svocab: gguf.SpecialVocab, concurrency: int = DEFAULT_CONCURRENCY) -> None:
        check_vocab_size(params, vocab)

        of = OutputFile(fname_out)

        # meta data
        of.add_meta_arch(params)
        of.add_meta_vocab(vocab)
        of.add_meta_special_vocab(svocab)

        # tensor info
        for name, lazy_tensor in model.items():
            of.add_tensor_info(name, lazy_tensor)

The result:

ystem_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Build web site with 10 steps: first step, make sure the software program will be compatible using your computer or laptop.
Download our free demo to see how we can help you get more customers and increase sales! This is a very easy-to follow guide that shows even those who have never done anything like this before how it_s done. Just click on _free download_, enter the information, then sit back as your brand new website comes up fast with all of its features included for free _ no credit card necessary or any other payment plan required!
As you can see from our web site that we are one of the best SEO companies around and have been helping businesses get

I think the difference is asserts probably don’t run when compiled normally with optimization, so calculations just produce weird values instead of failing outright. There are two possible causes I can think of, the first is that the tensor data is just incorrect in the actual model file. The other is that the way the GGML operations are implemented just doesn’t work on big endian for some reason.

Hmm… I don’t know how much time you want to put into trying various random stuff but one thing to try would be just printing out some fairly small tensor. Maybe that could even be done from the conversion script. blk.0.attn_norm.weight looks like a good candidate, since it’s 1 dimensional.

You could possibly try something like:

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        if name == 'blk.0.attn_norm.weight':
            print('BEFORE', tensor)
        tensor.byteswap(inplace=True)
        if name == 'blk.0.attn_norm.weight':
            print('AFTER', tensor)

That just dumps the tensor before and after the byteswap stuff. If byteswapping was needed, I’d expect to see a bunch of crazy values in the “before” and more normal values in “after”.

The other place to do something similar would be in the main example or something, to just look up a small tensor like that one and dump some of the values right after loading the model. Waste time following my advice at your own peril. 😃

@KerfuffleV2 it seems there is endian conversion issue.

I printed the first 16 word of data member in model layer 0 attn_norm with gdb.

The first float 32 is 0x3d8d0000 and value is 0.0688476563 on x86. But it is 0x8d3d0000 on s390, but in fact the big endian representation should be 0x00008d3d on s390.

x86:

(gdb) x/16xw model.layers[0].attn_norm.data
0x7ffce5df99a0: 0x3d8d0000      0x3d250000      0x3de70000      0x3d630000
0x7ffce5df99b0: 0x3d3b0000      0x3d590000      0x3d3b0000      0x3d120000
0x7ffce5df99c0: 0x3d460000      0x3d4d0000      0x3d380000      0x3d270000
0x7ffce5df99d0: 0x3d3d0000      0x3dae0000      0x3d310000      0x3d040000

s390:

(gdb) x/16xw model.layers[0].attn_norm.data
0x3fcec14b9a0:  0x8d3d0000      0x253d0000      0xe73d0000      0x633d0000
0x3fcec14b9b0:  0x3b3d0000      0x593d0000      0x3b3d0000      0x123d0000
0x3fcec14b9c0:  0x463d0000      0x4d3d0000      0x383d0000      0x273d0000
0x3fcec14b9d0:  0x3d3d0000      0xae3d0000      0x313d0000      0x043d0000

but then pytorch model would have to be converted to gguf ?

Unfortunately, that still wouldn’t help you. The gguf Python stuff explicitly saves stuff as little endian, so stuff like the lengths of items, etc are all going to be in little-endian.

I think probably the easiest “fix” if one can convert the file themselves is to edit gguf-py/gguf/gguf.py and change all the struct formats to either use native (=) or BE (>). Also don’t use convert.py to quantize to q8_0 or also change the struct formats there.

gguf-py/gguf/gguf.py
486:        self.fout.write(struct.pack("<I", GGUF_MAGIC))
487:        self.fout.write(struct.pack("<I", GGUF_VERSION))
488:        self.fout.write(struct.pack("<Q", self.ti_data_count))
489:        self.fout.write(struct.pack("<Q", self.kv_data_count))
562:        GGUFValueType.UINT8:   "<B",
563:        GGUFValueType.INT8:    "<b",
564:        GGUFValueType.UINT16:  "<H",
565:        GGUFValueType.INT16:   "<h",
566:        GGUFValueType.UINT32:  "<I",
567:        GGUFValueType.INT32:   "<i",
568:        GGUFValueType.FLOAT32: "<f",
569:        GGUFValueType.UINT64:  "<Q",
570:        GGUFValueType.INT64:   "<q",
571:        GGUFValueType.FLOAT64: "<d",
579:            self.kv_data += struct.pack("<I", vtype)
587:            self.kv_data += struct.pack("<Q", len(encoded_val))
593:            self.kv_data += struct.pack("<I", ltype)
594:            self.kv_data += struct.pack("<Q", len(val))
608:        self.ti_data += struct.pack("<Q", len(encoded_name))
611:        self.ti_data += struct.pack("<I", n_dims)
613:            self.ti_data += struct.pack("<Q", tensor_shape[n_dims - 1 - i])
618:        self.ti_data += struct.pack("<I", dtype)
619:        self.ti_data += struct.pack("<Q", self.offset_tensor)

Not too hard to replace since they all start with "<

The current situation undoubtedly isn’t ideal though, this is just a hack to get it working by any means.

Basically, I found this: https://github.com/pytorch/pytorch/issues/65300

In the comments they explain pytorch can open models created on different endianness machine, which made me think, perhaps using pytorch on the target machine, to load a model and write it again to another file, would “fix” endianness, but then pytorch model would have to be converted to gguf ? I found #707 mentioning use of old convert script for pytorch to ggml, which could be ten converted to gguf

As long as the model is in correct endians for the host, It shouldn’t matter what endians does the host use, all file and memory reads/writes will be compatibly symmetric

Even if code uses 0xNNNN literals, like for model magic, as long as the binary writing that, has the same endians as the one reading it, value will be correct during execution

The only thing which could break things is union-like downcasting, but I have no idea if it’s used anywhere

Edit: Ok, this will be a problem: https://github.com/ggerganov/llama.cpp/blob/36b904e20003017f50108ae68359ef87a192dae2/ggml.c#L374-L390

Do you happen to know if convert scripts are endian-aware ?

Pretty sure they are, it uses numpy and Python’s struct with the correct format for little-endian. (Going by memory here.)

What I’d wonder about is actually loading the model. llama.cpp just mmaps the file, which is going to be full of little-endian data. I don’t think there’s any conversion or special handling to convert to big endian after loading, so… I’d expect that won’t work so great, even if you could get past the file magic being wrong.

Can try just hex-editing it to what is expected (or hacking the code to ignore the error and proceed). I doubt it’ll work though, I think all the quantizations use at least f16s which will be wonky on big endian? Not 100% sure though.

To be honest, I really wanted this to work 😃 I’m absolutely in love with the concept of putting LLM on a museum grade hardware, I’ve been hunting for a green phosphor CRT on eBay for months, for that exact reason 😃 Fallout style terminal and stuff 😃