DeepSpeed: Enable deepspeed.zero.Init causes very strange spikes in PPO policy_loss
Hi, we’re OpenRLHF team, we heavily use deepspeed to build our RLHF framework and really appreciate to your great work.
Recently we encounter a problem with deepspeed.zero.Init. We want to enable HfDeepSpeedConfig to support larger model(70B+) with ZeRO-3. But after enabling HfDeepSpeedConfig, we see some very strange spikes in PPO policy_loss. If HfDeepSpeedConfig is disabled, the policy_loss is normal and same with ZeRO-2.
Below is wandb metrics with Llama-7B PPO training
- red([ppo_0104T10:37]): ZeRO-3 w/o HfDeepSpeedConfig
- purple([ppo_0105T13:01]): ZeRO-3 w/ HfDeepSpeedConfig
The only difference is enable HfDeepSpeedConfig or not. The reproduce script is below: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray_70b.sh
BTW. I debug parameter’s sharding tensor with parameter.ds_tensor
, they’re same whether enable HfDeepSpeedConfig or not! So we really don’t know why. Can you offer some clues that we can keep debugging?
cc @stas00
About this issue
- Original URL
- State: open
- Created 6 months ago
- Reactions: 2
- Comments: 68 (33 by maintainers)
Thank you, Tunji.
As promised I started to write down the debug methodologies I developed in this repo:
https://github.com/stas00/the-art-of-debugging
and meanwhile we reduced the problem further to discover it’s
div
that’s non-deterministic wrt device it’s run on:Hi everyone 👋 FYI, we’re having a look at this issue from the
transformers
side, including potentially forcing the computations of RoPE to be done in FP32. This issue will track it.Glad to see you have implemented a workaround that solves your problem, @wuxibin89 and @hijkzzz
The other cleaner workaround that I suggested would be to init a tiny model on cpu first and then copy the buffers to the gpu-inited model. The only requirement from that model is to have the
dim
match. You can usefrom_config()
to do that. See https://github.com/stas00/ml-engineering/blob/fcc3ebe2e09d5f019d4a779ef6678fcac0690b00/transformers/make-tiny-models.md#L36-L52 as an example.I think the correct solution here is for HF to store the ground-truth buffer values in the model’s checkpoint on the HF hub, so that when the model is loaded it doesn’t matter if the math’s outcome was different from the values the model was trained on, as it’ll be overwritten by the stored buffers in the checkpoint.
I think the issue you have discovered may impact many other users so you at least want to bring it up at HF Transformers as an Issue, and possibly propose the long term solution I suggested in the previous paragraph.
Here is the final repro w/o deepspeed involved:
The problem comes from:
https://github.com/microsoft/DeepSpeed/blob/e62a47e2e83647cc119c39cc7cf98fcfb5451cba/deepspeed/runtime/zero/partition_parameters.py#L374
commenting it out removes the mismatch in the repro above
I scanned and found ~66 code blocks across many models that are using float dtypes in position embedding style enumerations that should be left as torch.long. I’ve submitted this and will likely create an issue for it, it’s already posted in slack there.
Note that GPT-NeoX and origina Llama do this correctly
https://github.com/EleutherAI/gpt-neox/blame/63991555ec082c8f80c475f851d008193b10008c/megatron/model/positional_embeddings.py#L27
https://github.com/facebookresearch/llama/blob/ef351e9cd9496c579bf9f2bb036ef11bdc5ca3d2/llama/model.py#L100-L103
And Transformers does not. So feel Transformers needs some fixes, this isn’t the only instance of the pattern where a float dtype has been used in the arange for an int enumeration …
So yeah, that’s a huge difference, and I feel it is definitely the arange issue pointed above, any use of zero.Init() context manager with a low precision dtype will trash the range > 256 for bfloat16 or > 2048 for float16. The dtype for arange should always be torch.long / torch.int and then cast to float dtype as late as possible.
EDIT: The div issue really should have little to no impact compared to above.
@stas00 Thanks for your amazing work, that’s a very impressive debugging process. To compatible with our previous experiments with ZeRO-2, we prefer to use the cpu-way which is to initialize RoPE buffer at CPU first and then copy to GPU.
For quick proof, I do some nasty hack with
modeling_llama.py
.This hack is to force initializing RoPE buffer at CPU and then copying to GPU. After this hack, the difference in my reproduce script has gone.
Moreover, I quickly run our e2e PPO training script with ZeRO-3 and deepspeed.zero.Init. The first 10 steps show that spike in policy_loss has gone! I will do some more experiments later.
@hijkzzz and @wuxibin89, so now that we understand well what’s going on how do we support your need?
How do you tell which of the 2 behaviours is the preferred or (correct?) one?
If you want the cpu-way as a short-term hack you could update the model’s buffers with their values calculated separately on cpu. I think since those buffers are invarient of the model size you could even create a tiny model on cpu and then copy them over to the zero.Init’ed model.
I’m also thinking that in this situation perhaps LLama-2 should store its exact buffers how they were when they were trained in the distributed model files, so that the user always gets the exact weights - since clearly this leads to ambiguity and possibly invalid inference outcome. So you may want to consider opening a request at HF Transformers?
It’s not a cuda bug, it’s just how hardware works. There are discrepancies even between 2 NVIDIA gpus, e.g. V100 and A100 won’t always produce the same math output
You can find a ton of threads discussing this: https://www.google.ca/search?q=pytorch+math+discrepancies+cuda+cpu
This gotcha should just be documented with a pointer to this Issue for those who want to understand it better.
Now which mode is better for the inference user will depend on how the model was instantiated for training. If it was on cpu, then loading it on cpu will lead to the closest outcome. If on gpu, then loading on gpu will be the closest (i.e. with
zero.Init
)@stas00, amazing debugging as always! I am stumped as to the solution here. Shall we file this as a cuda bug?
Thank you for making a super-easy to setup and use repro script, @hijkzzz and @wuxibin89
So first I tested that the issue has nothing to do with HfDeepSpeedConfig - I can repro the problem by removing it completely and just wrapping
from_pretrained
inzero.init
I also reduced it to just one sample and a sub-section of it:
already shows a largish diff. and it doesn’t fail if I make the slice smaller - like 200-250 tokens. It seems to start diverging at ~300 tokens.
Finally, I validated that it has to do with half-precision. If I switch to fp32 in ds config, the problem goes away (tiny difference of 1.e-5). Both bf16 and fp16 manifest this.
I thought that perhaps it might have had with RNG, but resetting the seed after model loading didn’t seem to make a difference.
oh and I had to hack transformers to deal with the removal of HfDeepSpeedConfig, otherwise it’d fail to load the weights.
So it has something to do with some precisions issue that gets aggravated with more tokens.
@stas00 some of the issues like having the calc move to the GPU or be done in lower precesion might be helped by that, but doesn’t solve the need to recalc on the fly or at init for different seq len, also thinking more the embed should really be applied to q/k in float32, having them in buffers means deep speed will downcast them, not sure that can easily be avoided unless you prevent them from being buffers of any sort…
I don’t think if FSDP does it, but they probably should as loading a huge model on FSDP is a huge problem right now.
I think that if a user sets
torch.cuda.set_device("cuda")
it will have the same effect.Also if the user or a framework calls
torch.set_default_dtype()
it could also have a problem unless an explicit dtype is used during the op. And we know HF transformers does that here:https://github.com/huggingface/transformers/blob/f40b87de0ca234df61f76928956c4a2118c0b548/src/transformers/modeling_utils.py#L1430
so I think this problem again isn’t DS-specific but a much wider one.
What do you think about my proposal to store the buffers with the model weights in the checkpoint? Then it’s even easier because it’ll always be correct.
low_cpu_mem_usage=True
will cause OOM for 70B modelsok, so the problem is this: the math on cpu and the cuda isn’t the same!
the trigger for this problem is this:
https://github.com/microsoft/DeepSpeed/blob/e62a47e2e83647cc119c39cc7cf98fcfb5451cba/deepspeed/runtime/zero/partition_parameters.py#L241-L243
If I add
This forces the code in repro to be calculated on the same device (
cpu
) and the math is exact.So we are moving to the next level here, the discrepancy comes from the devices, which don’t do math in the same way and the outcomes aren’t guaranteed to be identical.
So if the math(CUDA) != math(CPU) this is why you get different results.
A one time small math outcome discrepancy makes little difference at first but it gets compounded when it goes through dozens of layers, doing hundreds of
matmul
s and spreading and increasing the discrepancy wider and wider.So if my diagnosis make sense IMO there are no bugs in Deepspeed ZeRO wrt this outcome discrepancy issue.
But this nuance should be documented as it’s very important.
It’s the buffers! Llama-2 has 3 buffers per layer, e.g. for layer 28:
Here is the proof:
with bf16 there is a 63% difference:
with fp32 there is almost no difference, but it’s not 100% exact!
@tjruwase, now that we know the source of the problem, passing this to you now.
It must be something about how buffers are treated differently than params under
zero.Init
we can see significant logits difference with .max() .min()
@tjruwase @stas00 I create a reproduce script here with our very first micro batch data. https://gist.github.com/wuxibin89/7d4801d62a743d5bbc72b507e8f0e326
OK, the whole area of having more than 1 model is new and not well tested. When I designed
HfDeepSpeedConfig
’s side that controls whether we enclosezero.Init
around model loading or not in HF Transformers was always with just one model as Deepspeed didn’t support multiple models until around last summer. And back then it was clear that if you wanted to use stage3 you’d wantzero.Init
always on.HF Accelerate/Deepspeed integration that was done relatively recently already included a flag to turn
zero.Init
on/off. It shouldn’t be complicated to extendHfDeepSpeedConfig
to optionally have a flag to whetherzero.Init
should be on or off but it’d also require tweaks to HF Transformers’modeling_utils.py
as currently it just checksis_deepspeed_zero3_enabled
but would need to be changed to a new helperis_deepspeed_zero_init_enabled
which doesn’t currently exist.You might be able to hack this function to always return False, which would disable
zero.Init
https://github.com/huggingface/transformers/blob/29a2b1420633d322140062d7c76b807f41fb90aa/src/transformers/integrations/deepspeed.py#L261 I don’t remember if it’s being used elsewhere, but you can just hack https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py to replace it withFalse
wherever it’s called just in that file.HF Accelerate for example still doesn’t support multiple Deepspeed models, I’m not sure about other frameworks, so when we needed more than one model we were using the workaround of wrapping 2 models into a super-model and using the super-model’s forward to do this or that on each model. This works well if it fits your training mode.
So chances are that you’re running into this problem where you have one DS engine and two models and I don’t think that can work. or at least it shouldn’t work.
So you can do 2 things:
HfDeepSpeedConfig
and manually addzero.Init
on each model and instantiate 2 deepspeed engines - one for each model, and drive each of them separately.Please let me know if it’s helpful.