safetensors: `safetensor.torch.save_file()` throws `RuntimeError` - any recommended way to enforce?

was confronted with RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again. Can we explicitly disregard “potential differences”?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (3 by maintainers)

Most upvoted comments

As said on the issue in Transformers, if safetensors wants to take over the world, it needs to be less absolute and provide flexibility to their users. In Transformers when you save and reload weights as Transformers, we always takes care of re-tying the weights and yes they may be saved twice if the proper variables are not set, but that doesn’t mean the workflow of saving and reloading does not work.

Can we explicitly disregard “potential differences”?

No, this is a serious issue regarding the file, and it’s always better to confront it upfront rather than save erroneous files.

Some tensors share memory,

This means that you have two tensors using the same buffer:

A = torch.zeros((10, 10))
B = A[:2]

A & B now share the same buffer, and B is really indexing into A, which means modifications in A will modify B and vice versa.

Now this is a problem for saving on disk since after reload you won’t have that property depending on how you load them and if you load them on different devices for instance.

As a result, shared tensors are not allowed on disk, if you want that property, you would need to save only A, and recreate B after reload (it’s essentially free since A will be already loaded). This is how it’s done within transformers for sharing the token embeddings and the last LM layer (which are the same tensors)

It really depends on what those shared tensors do, they sometimes are actually just duplicates and you can just save one of them. More complicated situation could exist where 3 tensors refer to each other, in non-overlapping ways.

For reference, we recently had troubles with some llama weights being 37Gb instead of 26Gb and it was linked to something like saving B on disk, but torch by default will save the entire A buffer, not just the B view leading to massively larger file than necessary (and no way to access A again from this file)

Does this information help ?

If you’re willing to share a bit more details about those tensors, I would be happy to help thinking of solutions

A quick way to fix this is use tensor.clone() to process the shared tensor in original state_dict, which will disable the memory shared between view tensors.

It works here because there’s no real reason for the sharing (afaik the weights are actually disjoint). But it’s really hard to infer that from this library, and even then there might be good reasons why the sharing was done.

Like @sgugger said it should be up to the user to decide with

We can add a function like

weights = remove_duplicate(weights) weights = {k: v.contiguous() for k, v in weighs.items()

There’s a lot of way to deal with those shared buffers, and there’s not just 1 obvious solution which is why such thing is not already done.

You can have 2 shared buffers, which don’t share data:

A = torch.zeros((10, 10))
B = A[::2, :]
C = A[1::2, :]
save_file({"B": B, "C": C}, filename)

The same problem occurs to me while trying to convert PyTorch state_dict tensors to safetensors. Error message:

RuntimeError: Error while trying to find names to remove to save state dict, but found no suitable name to keep for saving amongst: {‘model.layers.22.mlp.gate_proj.weight’, ‘model.layers.22.mlp.up_proj.weight’}. None is covering the entire storage.Refusing to save/load the model since you could be storing much more memory than needed. Please refer to https://huggingface.co/docs/safetensors/torch_shared_tensors for more information. Or open an issue.

It turns out that model.layers.22.mlp.gate_proj.weight and model.layers.22.mlp.up_proj.weight are both a slice of the same tensor.

Tensor slice in PyTorch is a view operation which means that they shared the same memory storage. When we only save the sliced tensors without the original not sliced tensor. _remove_duplicate_names in safetensors will raise the error above. The error complains that it can find the not sliced tensor( I have not understood why safetensors need the not sliced tensor).

A quick way to fix this is use tensor.clone() to process the shared tensor in original state_dict, which will disable the memory shared between view tensors.

Just for future sake: I also get the mentioned RuntimeError (Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again:) when using deepspeed multi-node. For now I will not use safetensors during training and do the conversion after training instead.

I resolved the CUDA out of memory issue by modifying the compute_metrics() and preprocess_logits_for_metrics() functions. Now, I can successfully execute the script without using DeepSpeed.

  trainer = Trainer(
      compute_metrics=compute_metrics,
      preprocess_logits_for_metrics=preprocess_logits_for_metrics,
      ....
  )
  def compute_metrics(pred):
      labels_ids = pred.label_ids
      pred_ids = pred.predictions[0]

      preds = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)
      pred_str = tokenizer.batch_decode(preds, skip_special_tokens=True)
      labels_ids[labels_ids == -100] = tokenizer.pad_token_id
      label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
      metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}
      return metrics

  def preprocess_logits_for_metrics(logits, labels):
      """
      Original Trainer may have a memory leak. 
      This is a workaround to avoid storing too many tensors that are not needed.
      """
      pred_ids = torch.argmax(logits, dim=-1)
      return pred_ids, labels

How can I reproduce this ?

The error occurred when I executed this script. run_speech_recognition_ctc_adapter.py

python run_speech_recognition_ctc_adapter.py \
	--dataset_name="common_voice" \
	--model_name_or_path="facebook/mms-1b-all" \
	--dataset_config_name="zh-TW" \
	--output_dir="./wav2vec2-common_voice-tr-mms-demo" \
	--num_train_epochs="4" \
	--per_device_train_batch_size="32" \
	--learning_rate="1e-3" \
	--warmup_steps="100" \
	--evaluation_strategy="steps" \
	--text_column_name="sentence" \
	--length_column_name="input_length" \
	--save_steps="200" \
	--eval_steps="100" \
	--save_total_limit="3" \
         --target_language="zh-TW" \
	--gradient_checkpointing \
	--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
	--fp16 \
	--group_by_length \
	--do_train --do_eval

Hi all - I’m facing this error when saving a transformers model inside a thread pool.

My code looks roughly like this

from safetensors.torch import save_file

def convert(vae, unet, text_encoder):
    # This method is run by a thread

    # convert models here...
    state_dict = {**unet.state_dict(), **vae.state_dict(), **text_encoder.state_dict()}
    
    # trying to apply the solution described above
    state_dict = {k: v.clone().contiguous() for k, v in state_dict.items()}
    
    # save as safetensors
    save_file(state_dict, out_dir)

Interestingly, the issue only occurs for the text_encoder which is of type CLIPTextModel. The diffusers models (unet and vae) don’t complain.

@Narsil Do you have suggestions how to fix? I’ve tried to create a deep-copy of the weights before shipping it off to the thread, but no luck…

I’m using safetensors==0.3.0 and transformers==4.28.1.

This is the full trace:

line 292, in convert
    save_file(state_dict, out_dir)
  File "/usr/local/lib/python3.8/site-packages/safetensors/torch.py", line 72, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.8/site-packages/safetensors/torch.py", line 233, in _flatten
    raise RuntimeError(
RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.layer_norm2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.7.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.out_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.6.mlp.fc2.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.out_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.2.mlp.fc2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.5.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.7.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.5.layer_norm2.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.self_attn.v_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.3.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.8.self_attn.q_proj.weight', 'cond_stage_model.transformer.text_model.embeddings.token_embedding.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc1.weight', 'cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.embeddings.position_embedding.weight', 'cond_stage_model.transformer.text_model.encoder.layers.1.mlp.fc1.bias', 'cond_stage_model.transformer.text_model.encoder.layers.6.self_attn.k_proj.bias', 'cond_stage_model.transformer.text_model.encoder.layers.1.self_attn.v_proj.weight', 'cond_stage_model.transformer.text_model.encoder.layers.4.self_attn.out_proj.weight'}]

EDIT: Much later on I found the issue to what caused the above, maybe it can help someone in a similar situation:

As it turns out the problem was introduced already at load time. I had multiple threads loading a safetensors file from disk, at the same time.

model = UNet2DConditionModel.from_pretrained(unet_dir)

But every once in a while this caused weights to be not properly loaded and when saving the models later on they would raise the above error. In order to mitigate the issue I created a threading lock:

import threading
load_from_disk_lock = threading.Lock()

with load_from_disk_lock:
     model = UNet2DConditionModel.from_pretrained(unet_dir)

This properly fixed things. Anyway, hopefully this can help people.