DeepSpeed: Enable deepspeed.zero.Init causes very strange spikes in PPO policy_loss

Hi, we’re OpenRLHF team, we heavily use deepspeed to build our RLHF framework and really appreciate to your great work.

Recently we encounter a problem with deepspeed.zero.Init. We want to enable HfDeepSpeedConfig to support larger model(70B+) with ZeRO-3. But after enabling HfDeepSpeedConfig, we see some very strange spikes in PPO policy_loss. If HfDeepSpeedConfig is disabled, the policy_loss is normal and same with ZeRO-2.

Below is wandb metrics with Llama-7B PPO training

  • red([ppo_0104T10:37]): ZeRO-3 w/o HfDeepSpeedConfig
  • purple([ppo_0105T13:01]): ZeRO-3 w/ HfDeepSpeedConfig WechatIMG3076 image

The only difference is enable HfDeepSpeedConfig or not. The reproduce script is below: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_70b.sh

BTW. I debug parameter’s sharding tensor with parameter.ds_tensor, they’re same whether enable HfDeepSpeedConfig or not! So we really don’t know why. Can you offer some clues that we can keep debugging?

cc @stas00

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Reactions: 2
  • Comments: 68 (33 by maintainers)

Most upvoted comments

@stas00, amazing debugging as always!

Thank you, Tunji.

As promised I started to write down the debug methodologies I developed in this repo:

https://github.com/stas00/the-art-of-debugging

and meanwhile we reduced the problem further to discover it’s div that’s non-deterministic wrt device it’s run on:

import torch, struct
def binary_double(num):
    print(''.join(f'{c:0>8b}' for c in struct.pack('!d', num)))
binary_double(torch.tensor(9, device='cuda')/10)
binary_double(torch.tensor(9, device='cpu' )/10)
0011111111101100110011001100110011100000000000000000000000000000
0011111111101100110011001100110011000000000000000000000000000000

Hi everyone 👋 FYI, we’re having a look at this issue from the transformers side, including potentially forcing the computations of RoPE to be done in FP32. This issue will track it.

Glad to see you have implemented a workaround that solves your problem, @wuxibin89 and @hijkzzz

The other cleaner workaround that I suggested would be to init a tiny model on cpu first and then copy the buffers to the gpu-inited model. The only requirement from that model is to have the dim match. You can use from_config() to do that. See https://github.com/stas00/ml-engineering/blob/fcc3ebe2e09d5f019d4a779ef6678fcac0690b00/transformers/make-tiny-models.md#L36-L52 as an example.

I think the correct solution here is for HF to store the ground-truth buffer values in the model’s checkpoint on the HF hub, so that when the model is loaded it doesn’t matter if the math’s outcome was different from the values the model was trained on, as it’ll be overwritten by the stored buffers in the checkpoint.

I think the issue you have discovered may impact many other users so you at least want to bring it up at HF Transformers as an Issue, and possibly propose the long term solution I suggested in the previous paragraph.

Here is the final repro w/o deepspeed involved:

import torch

def do_math(device):
    inv_freq = (10 ** (torch.arange(0, 10, device=device) / 10))
    print(f"{inv_freq[9]:.20f}")
    return inv_freq.cpu()

a = do_math(torch.device("cpu"))
b = do_math(torch.device("cuda"))

torch.testing.assert_close(a, b, rtol=0.0, atol=0.0)
7.94328212738037109375
7.94328308105468750000
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    torch.testing.assert_close(a, b, rtol=0.0, atol=0.0)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not equal!

Mismatched elements: 2 / 10 (20.0%)
Greatest absolute difference: 9.5367431640625e-07 at index (9,)
Greatest relative difference: 1.200604771156577e-07 at index (9,)

Ross, do you have resources to take this to HF Transformers? It doesn’t look like @hijkzzz and @wuxibin89 are planning to take it there.

I was helping with the diagnostics but I’m no longer at HF to take the lead.

or perhaps please recommend who should be tagged from HF so that they could take the lead?

I scanned and found ~66 code blocks across many models that are using float dtypes in position embedding style enumerations that should be left as torch.long. I’ve submitted this and will likely create an issue for it, it’s already posted in slack there.

Note that GPT-NeoX and origina Llama do this correctly

https://github.com/EleutherAI/gpt-neox/blame/63991555ec082c8f80c475f851d008193b10008c/megatron/model/positional_embeddings.py#L27

https://github.com/facebookresearch/llama/blob/ef351e9cd9496c579bf9f2bb036ef11bdc5ca3d2/llama/model.py#L100-L103

And Transformers does not. So feel Transformers needs some fixes, this isn’t the only instance of the pattern where a float dtype has been used in the arange for an int enumeration …

If I compare just: “model.layers.0.self_attn.rotary_emb.sin_cached”

Mismatched elements: 324808 / 524288 (62.0%)
Greatest absolute difference: 2.0 at index (1211, 0)
Greatest relative difference: 2128.0 at index (3305, 26)

So yeah, that’s a huge difference, and I feel it is definitely the arange issue pointed above, any use of zero.Init() context manager with a low precision dtype will trash the range > 256 for bfloat16 or > 2048 for float16. The dtype for arange should always be torch.long / torch.int and then cast to float dtype as late as possible.

EDIT: The div issue really should have little to no impact compared to above.

@stas00 Thanks for your amazing work, that’s a very impressive debugging process. To compatible with our previous experiments with ZeRO-2, we prefer to use the cpu-way which is to initialize RoPE buffer at CPU first and then copy to GPU.

For quick proof, I do some nasty hack with modeling_llama.py.

diff --git a/src/transformers/models/llama/modeling_llama.py b/src/transformers/models/llama/modeling_llama.py
index 8ceee2d1d..f6d108986 100644
--- a/src/transformers/models/llama/modeling_llama.py
+++ b/src/transformers/models/llama/modeling_llama.py
@@ -124,6 +124,10 @@ class LlamaRotaryEmbedding(nn.Module):
     def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
         super().__init__()
 
+        import deepspeed
+        hooked_arange = torch.arange
+        torch.arange = deepspeed.runtime.zero.partition_parameters._orig_torch_arange
+
         self.dim = dim
         self.max_position_embeddings = max_position_embeddings
         self.base = base
@@ -134,6 +138,10 @@ class LlamaRotaryEmbedding(nn.Module):
         self._set_cos_sin_cache(
             seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
         )
+        self.cos_cached = self.cos_cached.to("cuda")
+        self.sin_cached = self.sin_cached.to("cuda")
+
+        torch.arange = hooked_arange
 
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len

This hack is to force initializing RoPE buffer at CPU and then copying to GPU. After this hack, the difference in my reproduce script has gone.

Moreover, I quickly run our e2e PPO training script with ZeRO-3 and deepspeed.zero.Init. The first 10 steps show that spike in policy_loss has gone! I will do some more experiments later.

screenshot-20240120-172531

@hijkzzz and @wuxibin89, so now that we understand well what’s going on how do we support your need?

How do you tell which of the 2 behaviours is the preferred or (correct?) one?

If you want the cpu-way as a short-term hack you could update the model’s buffers with their values calculated separately on cpu. I think since those buffers are invarient of the model size you could even create a tiny model on cpu and then copy them over to the zero.Init’ed model.

I’m also thinking that in this situation perhaps LLama-2 should store its exact buffers how they were when they were trained in the distributed model files, so that the user always gets the exact weights - since clearly this leads to ambiguity and possibly invalid inference outcome. So you may want to consider opening a request at HF Transformers?

It’s not a cuda bug, it’s just how hardware works. There are discrepancies even between 2 NVIDIA gpus, e.g. V100 and A100 won’t always produce the same math output

You can find a ton of threads discussing this: https://www.google.ca/search?q=pytorch+math+discrepancies+cuda+cpu

This gotcha should just be documented with a pointer to this Issue for those who want to understand it better.

Now which mode is better for the inference user will depend on how the model was instantiated for training. If it was on cpu, then loading it on cpu will lead to the closest outcome. If on gpu, then loading on gpu will be the closest (i.e. with zero.Init)

@stas00, amazing debugging as always! I am stumped as to the solution here. Shall we file this as a cuda bug?

Thank you for making a super-easy to setup and use repro script, @hijkzzz and @wuxibin89

So first I tested that the issue has nothing to do with HfDeepSpeedConfig - I can repro the problem by removing it completely and just wrapping from_pretrained in zero.init

    kwargs = dict(
        trust_remote_code=True,
        attn_implementation="flash_attention_2",
        torch_dtype=torch.bfloat16,
    )

    if args.enable_hf_deespeed and ds_config["zero_optimization"]["stage"] == 3:
        print("==========enable zero Init ============")
        with deepspeed.zero.Init(config_dict_or_path=ds_config):
            model = AutoModelForCausalLM.from_pretrained(args.pretrain, **kwargs)
    else:
        model = AutoModelForCausalLM.from_pretrained(args.pretrain,**kwargs)

I also reduced it to just one sample and a sub-section of it:

    start, end = (600, 900)
    input_ids = input_ids[0][start:end].unsqueeze(0)
    attention_mask = attention_mask[0][start:end].unsqueeze(0)

already shows a largish diff. and it doesn’t fail if I make the slice smaller - like 200-250 tokens. It seems to start diverging at ~300 tokens.

Finally, I validated that it has to do with half-precision. If I switch to fp32 in ds config, the problem goes away (tiny difference of 1.e-5). Both bf16 and fp16 manifest this.

I thought that perhaps it might have had with RNG, but resetting the seed after model loading didn’t seem to make a difference.

oh and I had to hack transformers to deal with the removal of HfDeepSpeedConfig, otherwise it’d fail to load the weights.

diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 9c4639b47..ce3a3e1c4 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -605,7 +605,7 @@ def _load_state_dict_into_model(model_to_load, state_dict, start_prefix):
         # Parameters of module and children will start with prefix. We can exit early if there are none in this
         # state_dict
         if len([key for key in state_dict if key.startswith(prefix)]) > 0:
-            if is_deepspeed_zero3_enabled():
+            if 1 or is_deepspeed_zero3_enabled():
                 import deepspeed

                 # In sharded models, each shard has only part of the full state_dict, so only gather

So it has something to do with some precisions issue that gets aggravated with more tokens.

@stas00 some of the issues like having the calc move to the GPU or be done in lower precesion might be helped by that, but doesn’t solve the need to recalc on the fly or at init for different seq len, also thinking more the embed should really be applied to q/k in float32, having them in buffers means deep speed will downcast them, not sure that can easily be avoided unless you prevent them from being buffers of any sort…

Yeah, it could be silently impacting many people using with this sort of init patching context manager, is DS the only common one?

I don’t think if FSDP does it, but they probably should as loading a huge model on FSDP is a huge problem right now.

I think that if a user sets torch.cuda.set_device("cuda") it will have the same effect.

Also if the user or a framework calls torch.set_default_dtype() it could also have a problem unless an explicit dtype is used during the op. And we know HF transformers does that here:

https://github.com/huggingface/transformers/blob/f40b87de0ca234df61f76928956c4a2118c0b548/src/transformers/modeling_utils.py#L1430

so I think this problem again isn’t DS-specific but a much wider one.

What do you think about my proposal to store the buffers with the model weights in the checkpoint? Then it’s even easier because it’ll always be correct.

low_cpu_mem_usage=True will cause OOM for 70B models

ok, so the problem is this: the math on cpu and the cuda isn’t the same!

the trigger for this problem is this:

https://github.com/microsoft/DeepSpeed/blob/e62a47e2e83647cc119c39cc7cf98fcfb5451cba/deepspeed/runtime/zero/partition_parameters.py#L241-L243

If I add

kwargs['device'] = None

This forces the code in repro to be calculated on the same device (cpu) and the math is exact.

So we are moving to the next level here, the discrepancy comes from the devices, which don’t do math in the same way and the outcomes aren’t guaranteed to be identical.

  1. w/ zero Init the model is created on CUDA
  2. w/o zero Init the model is created on CPU

So if the math(CUDA) != math(CPU) this is why you get different results.

A one time small math outcome discrepancy makes little difference at first but it gets compounded when it goes through dozens of layers, doing hundreds of matmuls and spreading and increasing the discrepancy wider and wider.

So if my diagnosis make sense IMO there are no bugs in Deepspeed ZeRO wrt this outcome discrepancy issue.

But this nuance should be documented as it’s very important.

It’s the buffers! Llama-2 has 3 buffers per layer, e.g. for layer 28:

model.layers.28.self_attn.rotary_emb.inv_freq
model.layers.28.self_attn.rotary_emb.cos_cached
model.layers.28.self_attn.rotary_emb.sin_cached

Here is the proof:

    kwargs = dict(
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )

    model_a = AutoModelForCausalLM.from_pretrained(args.pretrain, **kwargs)
    buffers_a = dict(model_a.named_buffers())
    del model_a

    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        model_b = AutoModelForCausalLM.from_pretrained(args.pretrain, **kwargs)
    buffers_b = {k: v.cpu() for k, v in dict(model_b.named_buffers()).items()}

    torch.testing.assert_close(buffers_a, buffers_b, rtol=0.0, atol=0.0)

with bf16 there is a 63% difference:

Mismatched elements: 329332 / 524288 (62.8%)
Greatest absolute difference: 2.0 at index (1148, 2)
Greatest relative difference: 8320.0 at index (3657, 11)

with fp32 there is almost no difference, but it’s not 100% exact!

Mismatched elements: 88 / 524288 (0.0%)
Greatest absolute difference: 0.00390625 at index (2308, 47)
Greatest relative difference: 0.00762939453125 at index (4060, 22)

@tjruwase, now that we know the source of the problem, passing this to you now.

It must be something about how buffers are treated differently than params under zero.Init

@tjruwase @stas00 I create a reproduce script here with our very first micro batch data. https://gist.github.com/wuxibin89/7d4801d62a743d5bbc72b507e8f0e326

we can see significant logits difference with .max() .min()

>>> import torch
>>> s0 = torch.load("result_False.pt", map_location="cpu")
>>> s1 = torch.load("result_True.pt", map_location="cpu")
>>> (s0["log_probs"] - s1["log_probs"]).sum()
tensor(51.7910)
>>> (s0["log_probs"] - s1["log_probs"]).mean()
tensor(0.0020)
>>>
>>>
>>>
>>> (s0["log_probs"] - s1["log_probs"]).max()
tensor(4.3943)
>>> (s0["log_probs"] - s1["log_probs"]).min()
tensor(-2.8594)

@tjruwase @stas00 I create a reproduce script here with our very first micro batch data. https://gist.github.com/wuxibin89/7d4801d62a743d5bbc72b507e8f0e326

OK, the whole area of having more than 1 model is new and not well tested. When I designed HfDeepSpeedConfig’s side that controls whether we enclose zero.Init around model loading or not in HF Transformers was always with just one model as Deepspeed didn’t support multiple models until around last summer. And back then it was clear that if you wanted to use stage3 you’d want zero.Init always on.

HF Accelerate/Deepspeed integration that was done relatively recently already included a flag to turn zero.Init on/off. It shouldn’t be complicated to extend HfDeepSpeedConfig to optionally have a flag to whether zero.Init should be on or off but it’d also require tweaks to HF Transformers’ modeling_utils.py as currently it just checks is_deepspeed_zero3_enabled but would need to be changed to a new helper is_deepspeed_zero_init_enabled which doesn’t currently exist.

You might be able to hack this function to always return False, which would disable zero.Init https://github.com/huggingface/transformers/blob/29a2b1420633d322140062d7c76b807f41fb90aa/src/transformers/integrations/deepspeed.py#L261 I don’t remember if it’s being used elsewhere, but you can just hack https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py to replace it with False wherever it’s called just in that file.

HF Accelerate for example still doesn’t support multiple Deepspeed models, I’m not sure about other frameworks, so when we needed more than one model we were using the workaround of wrapping 2 models into a super-model and using the super-model’s forward to do this or that on each model. This works well if it fits your training mode.

So chances are that you’re running into this problem where you have one DS engine and two models and I don’t think that can work. or at least it shouldn’t work.

So you can do 2 things:

  1. Try the super-model approach if it fits the workflow
  2. If you need 2 separate engines - do not rely on HfDeepSpeedConfig and manually add zero.Init on each model and instantiate 2 deepspeed engines - one for each model, and drive each of them separately.

Please let me know if it’s helpful.