trl: KTO hangs when using DeepSpeed
Hey there,
I’ve run into an issue where KTO training hangs when I run it using DeepSpeed - it runs for a couple steps and then I first get loss values and reward values of 0 or NaN:
{'loss': 0.5, 'learning_rate': 9.99e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -202.1331024169922, 'logps/chosen': -336.5407409667969, 'logits/rejected': -116.42853546142578, 'logits/chosen': -97.47003173828125, 'kl': 0.0, 'epoch': 0.0}
{'loss': 0.6534, 'learning_rate': 9.98e-05, 'rewards/chosen': -0.4333324432373047, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': -61.395660400390625, 'logits/rejected': nan, 'logits/chosen': -104.99060821533203, 'kl': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 9.970000000000001e-05, 'rewards/chosen': -0.6865890622138977, 'rewards/rejected': -0.010017603635787964, 'rewards/accuracies': 0.0, 'rewards/margins': -0.6765714883804321, 'logps/rejected': -70.71566772460938, 'logps/chosen': -198.41256713867188, 'logits/rejected': -107.62744140625, 'logits/chosen': -102.08055114746094, 'kl': nan, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 9.960000000000001e-05, 'rewards/chosen': nan, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': nan, 'logits/rejected': nan, 'logits/chosen': nan, 'kl': nan, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.95e-05, 'rewards/chosen': nan, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': nan, 'logits/rejected': nan, 'logits/chosen': nan, 'kl': nan, 'epoch': 0.02}
and then it hangs. When I command C to kill it, it just shows the processing waiting on PID:
Traceback (most recent call last):
File "/home/fsuser/.local/bin/deepspeed", line 6, in <module>
main()
File "/home/fsuser/.local/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/usr/lib/python3.9/subprocess.py", line 1189, in wait
return self._wait(timeout=timeout)
File "/usr/lib/python3.9/subprocess.py", line 1933, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/lib/python3.9/subprocess.py", line 1891, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
When I do not use DeepSpeed, everything looks fine.
To Reproduce
- Run examples/kto.py script
- Use these args:
export CUDA_VISIBLE_DEVICES=0,1,2,3
deepspeed --master_port 6000 kto.py \
--model_name_or_path "gpt2" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-4 \
--max_steps 1000 \
--report_to "wandb" \
--gradient_checkpointing True \
--output_dir "./test" \
--evaluation_strategy "steps" \
--eval_steps 10 \
--logging_first_step True \
--logging_steps 1 \
--beta 0.1 \
--deepspeed ds_config.json \
ds_config.jsonis:
{
"fp16": {
"enabled": "auto"
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 500,
"wall_clock_breakdown": false
}
However, it doesn’t seem to matter which DS stage I use and whether I use CPU offloading – I’m run into the same issue.
Versions
trl==0.7.12.dev0
transformers==4.36.2
accelerate==0.27.2
deepspeed==0.13.2
torch==2.0.0+cu118
Tagging @kashif for now since he authored the KTOTrainer. Thanks in advance
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 24 (6 by maintainers)
even if you hotfix this with a larger batch size, the training will be worse because of the higher variance in loss estimates (e.g., the KL term could be estimated with as little as one (x,y’) pair, which could be way off)
@dblakely I’m working on a PR to fix this. will comment here once it’s up and merged
Browsed into this, run accelerate using the
--debugorACCELERATE_DEBUG_MODE=1, it’ll give a clearer error to what is going onhmm so this appears to happen whenever you try to accelerate.gather on any tensor that is not a 1x1 tensor, which is weird because torch.gather doesn’t have this requirement. whenever you have a tensor that is not 1x1, accelerate may or may not hang. i’ve patched this problem by just taking the nanmean before the gather.
@hbin0701 can you see if the code on my fork works for you? https://github.com/kawine/trl
the only functions i changed were kto_loss and get_batch_loss_metrics in case you just want to copy-paste
@dblakely that happens because each microbatch contains a random mix of positive/negative examples, so if by chance the microbatch contains all negative/all positive, the rejected/chosen rewards can be NaN respectively
It happens quite frequently in the current implementation because the reported stats are only from the main process (i.e., from one microbatch, not the whole batch). Gathering these metrics across all microbatches should reduce the odds of a NaN, but it also causes the hanging issue in deepseed
so the NaNs should not affect the quality of your training, but nonetheless I’ll see if I can get rid of them