llama2.c: Stuck on training: Created a PretokDataset with rng seed 42

When I try to train the model, I run into some problems, don’t know if anyone has the same problem or how should I solve this problem.

When I execute the training code (below), the log will always be stuck on the output of Created a PretokDataset with rng seed 42, and there will be no change for several hours.

Below are some key steps I performed along with my device information.

python train.py

The corresponding output is roughly as follows:

(base) jupyter@instance-20230817-103839:~/llama2/llama2.c$ python train.py
tokens per iteration will be: 131,072
breaks down as: 4 grad accum steps * 1 processes * 128 batch size * 256 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 43, with 15,187,968 parameters
num non-decayed parameter tensors: 13, with 3,744 parameters
using fused AdamW: True
Created a PretokDataset with rng seed 42

The GPU information of my machine is roughly as follows:

Thu Aug 17 05:21:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    52W / 400W |   1045MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     40510      C   python                           1042MiB |
+-----------------------------------------------------------------------------+

The CPU information of my machine is roughly as follows:12 vCPUs, 85GB RAM

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 22 (16 by maintainers)

Most upvoted comments

@CatTimson @madroidmaq can you try setting pin_memory=False in this line ?

@RahulSChand According to your method, my problem disappeared, I can train and see the detailed data of each step of training, thank you very much. like this:

...
10405 | loss 1.2478 | lr 4.889483e-04 | 345.92ms | mfu 11.71%
10406 | loss 1.2216 | lr 4.889459e-04 | 346.45ms | mfu 11.71%
10407 | loss 1.2321 | lr 4.889436e-04 | 346.29ms | mfu 11.71%
10408 | loss 1.2726 | lr 4.889413e-04 | 345.98ms | mfu 11.71%
10409 | loss 1.2325 | lr 4.889389e-04 | 346.04ms | mfu 11.71%
...

I’d like to know why this tweak works, or what sources I should be looking at for this information. Looking forward to your reply.

@madroidmaq since training gets stuck at dataloader, I looked up PyTorch issues for it & found a similar issue other people reported when pin_memory=True (link). You can read that thread to know more about why it happens.

@madroidmaq You can do apt install sentencepiece and the spm_train command should work. I don’t think this change in script is necessary.

@madroidmaq can you remove this time.time() from following lines?https://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L249 https://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L322

Just put t0=0 and t1=0 & check?

Also you can change below to print(...., flush=True) to see if the issue is occurring after first loss.backward https://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L262

I can’t reproduce the error so there is no way for me to check if the suggestion is actually correct but worth a try