llama2.c: Stuck on training: Created a PretokDataset with rng seed 42
When I try to train the model, I run into some problems, don’t know if anyone has the same problem or how should I solve this problem.
When I execute the training code (below), the log will always be stuck on the output of Created a PretokDataset with rng seed 42, and there will be no change for several hours.
Below are some key steps I performed along with my device information.
python train.py
The corresponding output is roughly as follows:
(base) jupyter@instance-20230817-103839:~/llama2/llama2.c$ python train.py
tokens per iteration will be: 131,072
breaks down as: 4 grad accum steps * 1 processes * 128 batch size * 256 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 43, with 15,187,968 parameters
num non-decayed parameter tensors: 13, with 3,744 parameters
using fused AdamW: True
Created a PretokDataset with rng seed 42
The GPU information of my machine is roughly as follows:
Thu Aug 17 05:21:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:04.0 Off | 0 |
| N/A 32C P0 52W / 400W | 1045MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 40510 C python 1042MiB |
+-----------------------------------------------------------------------------+
The CPU information of my machine is roughly as follows:12 vCPUs, 85GB RAM
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 22 (16 by maintainers)
@madroidmaq since training gets stuck at dataloader, I looked up PyTorch issues for it & found a similar issue other people reported when
pin_memory=True(link). You can read that thread to know more about why it happens.@madroidmaq You can do
apt install sentencepieceand thespm_traincommand should work. I don’t think this change in script is necessary.@madroidmaq can you remove this
time.time()from following lines?https://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L249 https://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L322Just put t0=0 and t1=0 & check?
Also you can change below to
print(...., flush=True)to see if the issue is occurring after firstloss.backwardhttps://github.com/karpathy/llama2.c/blob/bd182289c596fa6059eb7b3b7c8ccd04b5c90fc3/train.py#L262I can’t reproduce the error so there is no way for me to check if the suggestion is actually correct but worth a try