DeepSpeed: [BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat

I am trying to run DeepSpeed-Chat almost out of the box albeit with some custom data and minor modifications to the code across logging, data loading, and model checkpointing. However, from the perspective of model we are using DeepSpeed as-is. I am facing strange errors while running which was working well previously. The error is catalogued below (I am only pasting parts that I think are relevant):

Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.30s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.37s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.51s/it] You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide:https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc [2023-08-28 17:41:40,445] [INFO] [partition_parameters.py:332:exit] finished initializing model - num_params = 292, num_elems = 6.87B /.local/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op warnings.warn(“Initializing zero-element tensors is a no-op”)


File “/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py”, line 115, in decorate_context return func(*args, **kwargs) File “/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py”, line 306, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {‘id’: 292, ‘status’: ‘NOT_AVAILABLE’, ‘numel’: 0, ‘ds_numel’: 0, ‘shape’: (0,), ‘ds_shape’: (0, 4096), ‘requires_grad’: True, ‘grad_shape’: None, ‘persist’: True, ‘active_sub_modules’: {453}, ‘ds_tensor.shape’: torch.Size([0])}

The DS parameters are as follows: ‘’’ ZERO_STAGE=3 deepspeed main.py
–sft_only_data_path local/jsonfile
–data_split 10,0,0
–model_name_or_path meta-llama/Llama-2-7b-hf
–per_device_train_batch_size 8
–per_device_eval_batch_size 8
–gradient_checkpointing
–max_seq_len 512
–learning_rate 4e-4
–weight_decay 0.
–num_train_epochs 4
–gradient_accumulation_steps 1
–lr_scheduler_type cosine
–num_warmup_steps 100
–print_loss
–project ‘deepspeed-eval’
–seed 1234
–zero_stage $ZERO_STAGE
–deepspeed
–output_dir $OUTPUT
&> $OUTPUT/training.log ‘’’

The versions of the essential libraries are as follows:

CUDA driver version - 530.30.02 CUDA / mvcc version - 12.1 deepspeed version - 0.10.1 transformers version - 4.29.2 torch version - 2.1.0.dev20230828+cu121 python version - 3.8 datasets version - 2.12.0

Please let us know how to proceed with this strange error.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 21 (3 by maintainers)

Commits related to this issue

Most upvoted comments

I had re-installed transformers==4.31.0, but step3 still failed like before.

I tried and this does not work for me.

I tried and this does not work for me too