transformers: Shared tensors not correctly saved.
System Info
transformersversion: 4.36.0.dev0- Platform: Linux-4.19.0-25-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.17
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.2
- Accelerate version: 0.24.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: 8*A100
- Using distributed or parallel set-up in script?: accelerate + deepspeed zero3
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I am finetuning Fuyu-8B and found the code for calling model.save_pretrained method would run into error after upgrading to 4.36.0.
The error shows:
Removed shared tensor {'language_model.model.layers.12.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.k_layernorm.weight', 'language_model.model.layers.24.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.weight', 'language_model.model.layers.22.input_layernorm.weight', 'language_model.model.layers.25.self_attn.q_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.bias', 'language_model.model.layers.33.mlp.dense_4h_to_h.bias', 'language_model.model.layers.6.post_attention_layernorm.weight', 'language_model.model.layers.30.self_attn.query_key_value.weight', 'language_model.model.layers.5.self_attn.query_key_value.weight', 'language_model.model.layers.10.mlp.dense_h_to_4h.bias', 'language_model.model.layers.5.post_attention_layernorm.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.bias', 'language_model.model.layers.2.self_attn.query_key_value.bias', 'language_model.model.layers.4.input_layernorm.bias', 'language_model.model.layers.25.self_attn.k_layernorm.weight', 'language_model.model.layers.29.self_attn.query_key_value.weight', 'language_model.model.layers.13.self_attn.query_key_value.bias', 'language_model.lm_head.weight', 'language_model.model.layers.6.mlp.dense_h_to_4h.weight', 'language_model.model.layers.13.mlp.dense_4h_to_h.weight', 'language_model.model.layers.14.mlp.dense_h_to_4h.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.weight', 'language_model.model.layers.32.input_layernorm.weight', 'language_model.model.layers.19.mlp.dense_4h_to_h.bias', 'language_model.model.layers.24.self_attn.dense.bias', 'language_model.model.layers.5.self_attn.query_key_value.bias', 'language_model.model.layers.7.mlp.dense_4h_to_h.bias', 'language_model.model.layers.10.self_attn.query_key_value.bias', 'language_model.model.layers.18.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.post_attention_layernorm.bias', 'language_model.model.layers.11.self_attn.dense.weight', 'language_model.model.layers.28.self_attn.query_key_value.weight', 'language_model.model.layers.14.mlp.dense_4h_to_h.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.weight', 'language_model.model.layers.35.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.post_attention_layernorm.bias', 'language_model.model.layers.23.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.bias', 'language_model.model.final_layernorm.weight', 'language_model.model.layers.6.mlp.dense_4h_to_h.weight', 'language_model.model.layers.29.input_layernorm.weight', 'language_model.model.layers.13.self_attn.q_layernorm.bias', 'language_model.model.layers.6.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.query_key_value.weight', 'language_model.model.layers.35.post_attention_layernorm.bias', 'language_model.model.layers.23.self_attn.dense.bias', 'language_model.model.layers.16.self_attn.k_layernorm.weight', 'language_model.model.layers.32.self_attn.dense.weight', 'language_model.model.layers.25.self_attn.dense.bias', 'language_model.model.layers.9.self_attn.query_key_value.bias', 'language_model.model.layers.25.self_attn.k_layernorm.bias', 'language_model.model.layers.3.mlp.dense_h_to_4h.weight', 'language_model.model.layers.21.self_attn.q_layernorm.weight', 'language_model.model.layers.32.post_attention_layernorm.bias', 'language_model.model.layers.33.self_attn.q_layernorm.weight', 'language_model.model.layers.2.post_attention_layernorm.bias', 'language_model.model.layers.20.mlp.dense_4h_to_h.bias', 'language_model.model.layers.4.self_attn.k_layernorm.bias', 'language_model.model.layers.29.mlp.dense_4h_to_h.weight', 'language_model.model.layers.32.self_attn.dense.bias', 'language_model.model.layers.8.mlp.dense_h_to_4h.weight', 'language_model.model.layers.34.self_attn.query_key_value.bias', 'language_model.model.layers.35.self_attn.k_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_4h_to_h.bias', 'language_model.model.layers.8.self_attn.q_layernorm.bias', 'language_model.model.layers.32.self_attn.k_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.weight', 'language_model.model.layers.31.mlp.dense_4h_to_h.bias', 'language_model.model.layers.0.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.mlp.dense_h_to_4h.weight', 'language_model.model.layers.12.post_attention_layernorm.weight', 'language_model.model.layers.7.self_attn.query_key_value.weight', 'language_model.model.layers.13.input_layernorm.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.self_attn.k_layernorm.bias', 'language_model.model.layers.34.self_attn.q_layernorm.bias', 'language_model.model.layers.1.self_attn.k_layernorm.weight', 'language_model.model.layers.35.self_attn.q_layernorm.weight', 'language_model.model.layers.29.self_attn.k_layernorm.bias', 'language_model.model.layers.34.mlp.dense_4h_to_h.weight', 'language_model.model.layers.30.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.input_layernorm.bias', 'language_model.model.layers.18.self_attn.query_key_value.weight', 'language_model.model.layers.1.mlp.dense_h_to_4h.bias', 'language_model.model.layers.26.mlp.dense_h_to_4h.weight', 'language_model.model.layers.8.post_attention_layernorm.weight', 'language_model.model.layers.18.self_attn.dense.bias', 'language_model.model.layers.30.mlp.dense_4h_to_h.bias', 'language_model.model.layers.7.mlp.dense_h_to_4h.bias', 'language_model.model.layers.31.self_attn.dense.weight', 'language_model.model.layers.9.self_attn.query_key_value.weight', 'language_model.model.layers.12.input_layernorm.bias', 'language_model.model.layers.14.self_attn.q_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.bias', 'language_model.model.layers.6.self_attn.q_layernorm.bias', 'language_model.model.layers.30.self_attn.query_key_value.bias', 'language_model.model.layers.11.self_attn.q_layernorm.weight', 'language_model.model.layers.33.self_attn.dense.bias', 'language_model.model.layers.14.mlp.dense_h_to_4h.bias', 'language_model.model.layers.14.mlp.dense_4h_to_h.bias', 'language_model.model.layers.12.mlp.dense_h_to_4h.weight', 'language_model.model.layers.10.self_attn.dense.weight', 'language_model.model.layers.5.self_attn.k_layernorm.weight', 'language_model.model.layers.33.mlp.dense_h_to_4h.weight', 'language_model.model.layers.17.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.bias', 'language_model.model.layers.4.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.self_attn.query_key_value.weight', 'language_model.model.layers.8.input_layernorm.bias', 'language_model.model.layers.6.self_attn.k_layernorm.bias', 'language_model.model.layers.31.self_attn.dense.bias', 'language_model.model.layers.25.self_attn.query_key_value.bias', 'language_model.model.layers.34.self_attn.q_layernorm.weight', 'language_model.model.layers.7.input_layernorm.bias', 'language_model.model.layers.2.self_attn.k_layernorm.bias', 'language_model.model.layers.29.self_attn.q_layernorm.bias', 'language_model.model.layers.16.self_attn.query_key_value.bias', 'language_model.model.layers.35.mlp.dense_h_to_4h.weight', 'language_model.model.layers.35.post_attention_layernorm.weight', 'language_model.model.layers.1.self_attn.dense.weight', 'language_model.model.layers.4.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.input_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.weight', 'language_model.model.layers.14.input_layernorm.weight', 'language_model.model.layers.22.mlp.dense_4h_to_h.bias', 'language_model.model.layers.11.input_layernorm.weight', 'language_model.model.layers.27.self_attn.k_layernorm.bias', 'language_model.model.layers.18.mlp.dense_4h_to_h.bias', 'language_model.model.layers.25.mlp.dense_h_to_4h.bias', 'language_model.model.layers.32.input_layernorm.bias', 'language_model.model.layers.10.mlp.dense_h_to_4h.weight', 'language_model.model.layers.14.self_attn.k_layernorm.weight', 'language_model.model.layers.8.post_attention_layernorm.bias', 'language_model.model.layers.27.self_attn.dense.bias', 'language_model.model.layers.21.self_attn.k_layernorm.weight', 'language_model.model.layers.27.self_attn.q_layernorm.weight', 'language_model.model.layers.30.self_attn.dense.weight', 'language_model.model.layers.23.mlp.dense_4h_to_h.bias', 'language_model.model.layers.18.post_attention_layernorm.weight', 'language_model.model.layers.22.self_attn.q_layernorm.weight', 'language_model.model.layers.13.self_attn.dense.bias', 'language_model.model.layers.14.self_attn.query_key_value.bias', 'language_model.model.layers.10.self_attn.k_layernorm.bias', 'language_model.model.layers.34.input_layernorm.bias', 'language_model.model.layers.3.post_attention_layernorm.bias', 'language_model.model.layers.5.input_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.weight', 'language_model.model.layers.27.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_h_to_4h.weight', 'language_model.model.layers.28.self_attn.q_layernorm.weight', 'language_model.model.layers.5.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.weight', 'language_model.model.layers.21.input_layernorm.weight', 'language_model.model.layers.14.post_attention_layernorm.bias', 'language_model.model.layers.35.self_attn.query_key_value.bias', 'language_model.model.layers.10.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.self_attn.q_layernorm.bias', 'language_model.model.layers.25.input_layernorm.bias', 'language_model.model.layers.34.self_attn.dense.weight', 'language_model.model.layers.34.input_layernorm.weight', 'language_model.model.layers.5.self_attn.k_layernorm.bias', 'language_model.model.layers.2.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.self_attn.dense.bias', 'language_model.model.layers.17.mlp.dense_4h_to_h.bias', 'language_model.model.layers.13.mlp.dense_4h_to_h.bias', 'language_model.model.layers.21.self_attn.query_key_value.weight', 'language_model.model.lay
207 Saved checkpoint at epoch 1.
I try to set the safe_serialization=False, the warning disappear but the saved pytorch_model.bin only 2MB, comparing to around 18GB originally (using 4.35.0).
Expected behavior
See above
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 34 (11 by maintainers)
Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue.
basically my code saves a bert module in one folder, and saves the overall model in another folder. I hypothesize that when saving with safetensors, if it noticed that you are saving duplicate weights and biases, it saves the full thing once and when you try re-saving it, it will remove the shared modules (to save on disk space, I guess). In my case, it was removing all of my layers except for the tail Embedding layer.
luckily for me, setting safe_serialization = False fixed it for me. I hope you can figure out how to fix yours too @Luodian
I encountered the same issue (transformers==4.38.2, accelerate==0.27.2). This issue appears to be caused by the logic for identifying aliased (shared) tensors.
During debugging, I discovered:
save_pretrainedmethod, parameters are identified as shared (aliased) tensors (though they’re not actually shared tensors).Removed shared tensor ...log message.The shared tensors are only removed in
safe_serialization=True, thus, we can avoid this issue bysafe_serialization=False.