distributed-llama: Master process crashes running out of memory on a 8 GB RPi 5

I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size. When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection. Any idea why? I tried with the smallest, llama-2-7b model.

Terminal capture from master

segabor@bigfive:~/src/distributed-llama $ ./run.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 2
./run.sh: line 9: 268004 Segmentation fault      sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998

Worker capture:

^Csegabor@lohere:~/src/distributed-llama$ ./run.sh 
Listening on 0.0.0.0:9998...
Client connected
💡 sliceIndex: 1
💡 nSlices: 2
⏩ Received 56918016 bytes for block 0 (83092 kB/s)
⏩ Received 56918016 bytes for block 1 (112709 kB/s)
⏩ Received 56918016 bytes for block 2 (112486 kB/s)
⏩ Received 56918016 bytes for block 3 (112709 kB/s)
⏩ Received 56918016 bytes for block 4 (91069 kB/s)
⏩ Received 56918016 bytes for block 5 (114986 kB/s)
⏩ Received 56918016 bytes for block 6 (103865 kB/s)
⏩ Received 56918016 bytes for block 7 (106190 kB/s)
⏩ Received 56918016 bytes for block 8 (112709 kB/s)
⏩ Received 56918016 bytes for block 9 (63172 kB/s)
⏩ Received 56918016 bytes for block 10 (63172 kB/s)
⏩ Received 56918016 bytes for block 11 (63313 kB/s)
⏩ Received 56918016 bytes for block 12 (63313 kB/s)
⏩ Received 56918016 bytes for block 13 (63172 kB/s)
⏩ Received 56918016 bytes for block 14 (60810 kB/s)
⏩ Received 56918016 bytes for block 15 (64097 kB/s)
⏩ Received 56918016 bytes for block 16 (60551 kB/s)
⏩ Received 56918016 bytes for block 17 (60358 kB/s)
⏩ Received 56918016 bytes for block 18 (60423 kB/s)
⏩ Received 56918016 bytes for block 19 (61600 kB/s)
⏩ Received 56918016 bytes for block 20 (62205 kB/s)
⏩ Received 56918016 bytes for block 21 (61136 kB/s)
⏩ Received 56918016 bytes for block 22 (62138 kB/s)
⏩ Received 56918016 bytes for block 23 (64753 kB/s)
⏩ Received 56918016 bytes for block 24 (100208 kB/s)
⏩ Received 56918016 bytes for block 25 (112486 kB/s)
⏩ Received 56918016 bytes for block 26 (112486 kB/s)
⏩ Received 56918016 bytes for block 27 (114064 kB/s)
⏩ Received 56918016 bytes for block 28 (111823 kB/s)
⏩ Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

I think your results are correct. I think the problem here is that you use devices with different processor speed.

Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn’t split load depends on the processor speed.

You should observe much better results if you would use 2x RasPi5.

Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4?

Here’s the list of the llama model 7b and the converted weights file

segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor         100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor  3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor         105 Jan 25 09:46 params.json

Thanks! Apparently the size of weights doesn’t match the right file on your list. It’s of 70b! I’m going to close this ticket, no error!

Could you confirm the size of your file with weights?

b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz  staff           96 Dec  9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz  staff         6310 Jan  7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz  staff   7887097884 Jan  8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff  39706066972 Jan  8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff   4242882588 Jan  7 22:23 dllama_llama-2-7b_q40.bin
...

In your logs of the root node I don’t see this part:

...
💡 nSlices: 1
⏩ Loaded 4242882560 bytes

So it looks like the weights file doesn’t have all bytes.

  1. Could you run single instance on your RPi 4?
  2. Could you run single instance on your RPi 5?