trl: KTO hangs when using DeepSpeed

Hey there,

I’ve run into an issue where KTO training hangs when I run it using DeepSpeed - it runs for a couple steps and then I first get loss values and reward values of 0 or NaN:

{'loss': 0.5, 'learning_rate': 9.99e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -202.1331024169922, 'logps/chosen': -336.5407409667969, 'logits/rejected': -116.42853546142578, 'logits/chosen': -97.47003173828125, 'kl': 0.0, 'epoch': 0.0}                             
{'loss': 0.6534, 'learning_rate': 9.98e-05, 'rewards/chosen': -0.4333324432373047, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': -61.395660400390625, 'logits/rejected': nan, 'logits/chosen': -104.99060821533203, 'kl': 0.0, 'epoch': 0.01}                                                
{'loss': 0.0, 'learning_rate': 9.970000000000001e-05, 'rewards/chosen': -0.6865890622138977, 'rewards/rejected': -0.010017603635787964, 'rewards/accuracies': 0.0, 'rewards/margins': -0.6765714883804321, 'logps/rejected': -70.71566772460938, 'logps/chosen': -198.41256713867188, 'logits/rejected': -107.62744140625, 'logits/chosen': -102.08055114746094, 'kl': nan, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 9.960000000000001e-05, 'rewards/chosen': nan, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': nan, 'logits/rejected': nan, 'logits/chosen': nan, 'kl': nan, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.95e-05, 'rewards/chosen': nan, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': nan, 'logits/rejected': nan, 'logits/chosen': nan, 'kl': nan, 'epoch': 0.02}

and then it hangs. When I command C to kill it, it just shows the processing waiting on PID:

Traceback (most recent call last):
  File "/home/fsuser/.local/bin/deepspeed", line 6, in <module>
    main()
  File "/home/fsuser/.local/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 584, in main
    result.wait()
  File "/usr/lib/python3.9/subprocess.py", line 1189, in wait
    return self._wait(timeout=timeout)
  File "/usr/lib/python3.9/subprocess.py", line 1933, in _wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.9/subprocess.py", line 1891, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

When I do not use DeepSpeed, everything looks fine.

To Reproduce

export CUDA_VISIBLE_DEVICES=0,1,2,3

deepspeed --master_port 6000 kto.py \
    --model_name_or_path "gpt2" \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-4 \
    --max_steps 1000 \
    --report_to "wandb" \
    --gradient_checkpointing True  \
    --output_dir "./test" \
    --evaluation_strategy "steps" \
    --eval_steps 10 \
    --logging_first_step True \
    --logging_steps 1 \
    --beta 0.1 \
    --deepspeed ds_config.json \
  • ds_config.json is:
{
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 500,
    "wall_clock_breakdown": false
}

However, it doesn’t seem to matter which DS stage I use and whether I use CPU offloading – I’m run into the same issue.

Versions

trl==0.7.12.dev0
transformers==4.36.2
accelerate==0.27.2
deepspeed==0.13.2
torch==2.0.0+cu118

Tagging @kashif for now since he authored the KTOTrainer. Thanks in advance

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 24 (6 by maintainers)

Most upvoted comments

even if you hotfix this with a larger batch size, the training will be worse because of the higher variance in loss estimates (e.g., the KL term could be estimated with as little as one (x,y’) pair, which could be way off)

@dblakely I’m working on a PR to fix this. will comment here once it’s up and merged

Browsed into this, run accelerate using the --debug or ACCELERATE_DEBUG_MODE=1, it’ll give a clearer error to what is going on

hmm so this appears to happen whenever you try to accelerate.gather on any tensor that is not a 1x1 tensor, which is weird because torch.gather doesn’t have this requirement. whenever you have a tensor that is not 1x1, accelerate may or may not hang. i’ve patched this problem by just taking the nanmean before the gather.

@hbin0701 can you see if the code on my fork works for you? https://github.com/kawine/trl

the only functions i changed were kto_loss and get_batch_loss_metrics in case you just want to copy-paste

@dblakely that happens because each microbatch contains a random mix of positive/negative examples, so if by chance the microbatch contains all negative/all positive, the rejected/chosen rewards can be NaN respectively

It happens quite frequently in the current implementation because the reported stats are only from the main process (i.e., from one microbatch, not the whole batch). Gathering these metrics across all microbatches should reduce the odds of a NaN, but it also causes the hanging issue in deepseed

so the NaNs should not affect the quality of your training, but nonetheless I’ll see if I can get rid of them