grok-1: Segmentation fault in K8s Pod (8x H100's)
Hi, I am trying to run it, but the python3 ./run.py process exits eventually after it’s been running for about 10 minutes at 800% cpu usage.
I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.
Not much of the logs:
I can quickly restart the process now as I am in the pod:
pkill gotty
cd /grok-1
gotty -w python3 ./run.py
Ideas?
Commands used to deploy it
apt-get update ; apt-get upgrade -y ;
apt-get install pip wget git -y;
pip install dm_haiku==0.0.12;
pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
pip install numpy==1.26.4;
pip install sentencepiece==0.2.0;
pip install -U "huggingface_hub[cli]";
git clone https://github.com/xai-org/grok-1;
wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
mkdir /root/shm;
sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
cd /grok-1 && gotty -w python3 ./run.py;
Update 1
I’m trying to run it directly with python3 ./run.py (without gotty right now)
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 17
wow, so people that even have the right hardware can’t even run this normally? haha that is a FAIL!
Zblocker64(on Discord) suggested increasing stack size (ulimit -s) which is set to8192by default. I’ll try doubling it up to16384to see if that helps with the python’s segmentation fault.FWIW: WIndow’s the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)
I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it’s a different repo.
thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I’ve been asking to review it first, hence I had to use those
pip install'sThe
/disk has 1TiB of space and only about 300 GiB are used;/dev/shmis set to 640 GiB (tmpfs).Also, not sure where did you find the “link with the traceback” you referred to?
The user is pulling our leg when they are saying “model should be called SIGINT” they are just making fun of it crashing for them, but not adding anything of value to the ticket.