DeepSpeed: AsyncIO Error with Stage 3 + NVME offload

Hi,

When trying to use ZERO Stage 3 with NVME offloading, which is required for fitting large models into memory, I am seeing the following error:

/nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp: buffer nbytes != file bytes 28824000 != 28311552
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): \
Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.

I have inserted some debug print() statements at both the write and read python call sites, here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/swap_tensor/utils.py#L19-L26

And I observe that when I am writing, the tensor really is trying to write 28824000 bytes (because it has 14412000 elements). However when I do an ls -l /nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp I observe that the file only has 28311552 bytes as mentioned in the error message. So it seems that somehow the async write command is failing to properly write the full contents of the tensor.

Any idea why this would happen? Or suggestions for how to debug further?

I have tried looking at kernel logs via dmesg, but nothing turns up. I have also tried running program with strace -e trace=io_submit,io_getevents,io_setup, but I only see the io_setup syscall and not the io_submit or io_getevents syscalls.

I do have libaio_dev installed.

My deepspeed config looks like this:

zero_optimization:
  stage: 3
  stage3_prefetch_bucket_size: 1e9
  stage3_param_persistence_threshold: 1e6
  stage3_max_live_parameters: 1e9
  overlap_comm: true
  contiguous_gradients: true
  offload_param:
      device: nvme
      nvme_path: /nvme
      pin_memory: false
      max_in_cpu: 1e9
      buffer_size: 1e9
      buffer_count: 5
    offload_optimizer:
      device: nvme
      nvme_path: /nvme
      pin_memory: false

Thanks, Stephen

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

@tjruwase yes, I have tested #1086 via commit 38d46848f450a080c8ab96427d9da00c2e5b6327, and it works for me now. Thanks for quick fix!

Actually, everything works fine if I use a single GPU. I think the underlying problem is this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L681-L682

i.e.:

tensor_size = self._aligned_size(param)
partition_size = tensor_size // self.world_size

While the tensor-size is aligned properly, after dividing by world size the partition_size is not aligned.

And the get_buffer() call here uses the compute_buffer not the aligned swap_buffer: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L689

@tjruwase I have identified the problem is here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1027-L1029

Basically all the aio read calls directly in partitioned_param_swapper.py are properly aligned after the PR you pointed me to, because the swap-in buffers are explicitly aligned.

However in the above code I pointed to, there is still a code-path that directly calls the swap_in() function with an explicit (and un-aligned) swap_in_buffers parameter.

Ah, I see the PR you mentioned was just merged yesterday. I tried again, using DeepSpeed commit d88d92799553cb75af536fd4af766ec56e4018cd.

But I still see the same problem:

/ssd1/zero_stage_3/fp16params/rank26/0_param.tensor.swp: buffer nbytes != file bytes 57648000 != 57648128
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.

Do you currently have a compelling use of the aio library outside of ZeRO-Infinity?

I do not. I simply tried to use it to create a minimally-reproducing script that demonstrates the bug I am seeing, because running the full training code takes a long time until it reports an error.

EDIT:

Actually upon closer reflection, the error is different now. The aio write is now succesfully writing out the aligned number of bytes. It tries to write 57648128 bytes and it does successfully do that. However the swap-in buffer it is trying to read the parameter into is not aligned. i.e. 57648000 is the true number of elements for the tensor, but it is not divisble by 512, so does not equal to the aligned write size, so the assertion above is triggering.