grok-1: Segmentation fault in K8s Pod (8x H100's)

Hi, I am trying to run it, but the python3 ./run.py process exits eventually after it’s been running for about 10 minutes at 800% cpu usage. I am running it in a K8s pod (with /dev/shm of 640Gi; 58 CPU threads [AMD EPYC 9554]; 1280 Gi RAM) with 8x h100 GPUs.

image

Not much of the logs: image

I can quickly restart the process now as I am in the pod:

pkill gotty
cd /grok-1
gotty -w python3 ./run.py

Ideas?

Commands used to deploy it

        apt-get update ; apt-get upgrade -y ;
        apt-get install pip wget git -y;
        pip install dm_haiku==0.0.12;
        pip install jax[cuda12_pip]==0.4.25 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
        pip install numpy==1.26.4;
        pip install sentencepiece==0.2.0;
        pip install -U "huggingface_hub[cli]";
        git clone https://github.com/xai-org/grok-1;
        wget https://github.com/yudai/gotty/releases/download/v2.0.0-alpha.3/gotty_2.0.0-alpha.3_linux_amd64.tar.gz;
        tar -zxvf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; chmod +x gotty ; rm -rf gotty_2.0.0-alpha.3_linux_amd64.tar.gz ; mv gotty /usr/local/bin/;
        huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/tensor* --local-dir /grok-1/checkpoints --local-dir-use-symlinks False;
        mv /grok-1/checkpoints/ckpt /grok-1/checkpoints/ckpt-0;
        mkdir /root/shm;
        sed -i "s;/dev/shm/;/root/shm/;g" /grok-1/checkpoint.py;
        cd /grok-1 && gotty -w python3 ./run.py;

Update 1

I’m trying to run it directly with python3 ./run.py (without gotty right now)

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 17

Most upvoted comments

wow, so people that even have the right hardware can’t even run this normally? haha that is a FAIL!

Zblocker64 (on Discord) suggested increasing stack size (ulimit -s) which is set to 8192 by default. I’ll try doubling it up to 16384 to see if that helps with the python’s segmentation fault.

FWIW: WIndow’s the default stack size is 1 MB for 32-bit applications and 4 MB for 64-bit applications. macOS typically defaults to 8 MB (but this depends on the macOS version / and the way app was compiled)

I just realised the other issue was actually a linked repository sorry. It just showed on this PR but it’s a different repo.

but you should really deploy from the requirements.txt because it will get updated.

thanks, I am well aware of that and have used it eventually. But had to use the code someone else pasted as I’ve been asking to review it first, hence I had to use those pip install's

# pip install -r requirements.txt
Requirement already satisfied: dm_haiku==0.0.12 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (0.0.12)
Requirement already satisfied: jax==0.4.25 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.4.25)
Requirement already satisfied: numpy==1.26.4 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 3)) (1.26.4)
Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.12/site-packages (from -r requirements.txt (line 4)) (0.2.0)
Requirement already satisfied: absl-py>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: jmp>=0.0.2 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.0.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.9.0)
Requirement already satisfied: flax>=0.7.1 in /usr/local/lib/python3.12/site-packages (from dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.8.2)
Requirement already satisfied: ml-dtypes>=0.2.0 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (0.3.2)
Requirement already satisfied: opt-einsum in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /usr/local/lib/python3.12/site-packages (from jax==0.4.25->-r requirements.txt (line 2)) (1.12.0)
Requirement already satisfied: msgpack in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: optax in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.2.1)
Requirement already satisfied: orbax-checkpoint in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.5.6)
Requirement already satisfied: tensorstore in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.56)
Requirement already satisfied: rich>=11.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (13.7.1)
Requirement already satisfied: typing-extensions>=4.2 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (4.10.0)
Requirement already satisfied: PyYAML>=5.4.1 in /usr/local/lib/python3.12/site-packages (from flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.0.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/site-packages (from rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2.17.2)
Requirement already satisfied: chex>=0.1.7 in /usr/local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.85)
Requirement already satisfied: jaxlib>=0.1.37 in /root/.local/lib/python3.12/site-packages (from optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.4.25+cuda12.cudnn89)
Requirement already satisfied: etils[epath,epy] in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.7.0)
Requirement already satisfied: nest_asyncio in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.12/site-packages (from orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (5.26.0)
Requirement already satisfied: toolz>=0.9.0 in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.12.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/site-packages (from chex>=0.1.7->optax->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (69.1.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (0.1.2)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (2024.3.0)
Requirement already satisfied: importlib_resources in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (6.3.1)
Requirement already satisfied: zipp in /usr/local/lib/python3.12/site-packages (from etils[epath,epy]->orbax-checkpoint->flax>=0.7.1->dm_haiku==0.0.12->-r requirements.txt (line 1)) (3.18.1)

And your link with the traceback is super obvious, it’s in the last line: “No space left on device”.

The / disk has 1TiB of space and only about 300 GiB are used; /dev/shm is set to 640 GiB (tmpfs).

Also, not sure where did you find the “link with the traceback” you referred to?

This model should be called SIGINT, because thats what will happen when you try to run it

I’m just following the original readme. There is no word about that the model should be called SIGINT and why do you think would it need to be interrupted anyway?

The first issue is that it exits prematurely before it would finish printing the complete output. (Even when running directly on the host, not the K8s container)

The second issue is that it can’t seem to run well in K8s pod, sometimes it runs sometimes it won’t. And it seems to always fail when grok-1 (and checkpoints) directory is on overlay FS.

The user is pulling our leg when they are saying “model should be called SIGINT” they are just making fun of it crashing for them, but not adding anything of value to the ticket.