safetensors: `safetensor.torch.save_file()` throws `RuntimeError` - any recommended way to enforce?
was confronted with RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again
.
Can we explicitly disregard “potential differences”?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (3 by maintainers)
As said on the issue in Transformers, if
safetensors
wants to take over the world, it needs to be less absolute and provide flexibility to their users. In Transformers when you save and reload weights as Transformers, we always takes care of re-tying the weights and yes they may be saved twice if the proper variables are not set, but that doesn’t mean the workflow of saving and reloading does not work.No, this is a serious issue regarding the file, and it’s always better to confront it upfront rather than save erroneous files.
This means that you have two tensors using the same buffer:
A & B now share the same buffer, and B is really indexing into A, which means modifications in A will modify B and vice versa.
Now this is a problem for saving on disk since after reload you won’t have that property depending on how you load them and if you load them on different devices for instance.
As a result, shared tensors are not allowed on disk, if you want that property, you would need to save only A, and recreate B after reload (it’s essentially free since A will be already loaded). This is how it’s done within
transformers
for sharing the token embeddings and the last LM layer (which are the same tensors)It really depends on what those shared tensors do, they sometimes are actually just duplicates and you can just save one of them. More complicated situation could exist where 3 tensors refer to each other, in non-overlapping ways.
For reference, we recently had troubles with some llama weights being 37Gb instead of 26Gb and it was linked to something like saving B on disk, but torch by default will save the entire A buffer, not just the B view leading to massively larger file than necessary (and no way to access A again from this file)
Does this information help ?
If you’re willing to share a bit more details about those tensors, I would be happy to help thinking of solutions
It works here because there’s no real reason for the sharing (afaik the weights are actually disjoint). But it’s really hard to infer that from this library, and even then there might be good reasons why the sharing was done.
The same problem occurs to me while trying to convert PyTorch state_dict tensors to safetensors. Error message:
It turns out that
model.layers.22.mlp.gate_proj.weight
andmodel.layers.22.mlp.up_proj.weight
are both a slice of the same tensor.Tensor slice in PyTorch is a view operation which means that they shared the same memory storage. When we only save the sliced tensors without the original not sliced tensor.
_remove_duplicate_names
in safetensors will raise the error above. The error complains that it can find the not sliced tensor( I have not understood why safetensors need the not sliced tensor).A quick way to fix this is use
tensor.clone()
to process the shared tensor in original state_dict, which will disable the memory shared between view tensors.Just for future sake: I also get the mentioned RuntimeError (Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again:) when using deepspeed multi-node. For now I will not use safetensors during training and do the conversion after training instead.
I resolved the CUDA out of memory issue by modifying the compute_metrics() and preprocess_logits_for_metrics() functions. Now, I can successfully execute the script without using DeepSpeed.
The error occurred when I executed this script. run_speech_recognition_ctc_adapter.py
Hi all - I’m facing this error when saving a transformers model inside a thread pool.
My code looks roughly like this
Interestingly, the issue only occurs for the
text_encoder
which is of typeCLIPTextModel
. The diffusers models (unet and vae) don’t complain.@Narsil Do you have suggestions how to fix? I’ve tried to create a deep-copy of the weights before shipping it off to the thread, but no luck…
I’m using
safetensors==0.3.0
andtransformers==4.28.1
.This is the full trace:
EDIT: Much later on I found the issue to what caused the above, maybe it can help someone in a similar situation:
As it turns out the problem was introduced already at load time. I had multiple threads loading a safetensors file from disk, at the same time.
But every once in a while this caused weights to be not properly loaded and when saving the models later on they would raise the above error. In order to mitigate the issue I created a threading lock:
This properly fixed things. Anyway, hopefully this can help people.