DeepSpeed: AsyncIO Error with Stage 3 + NVME offload
Hi,
When trying to use ZERO Stage 3 with NVME offloading, which is required for fitting large models into memory, I am seeing the following error:
/nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp: buffer nbytes != file bytes 28824000 != 28311552
python: /usr/local/lib/python3.6/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:223: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): \
Assertion `static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed.
I have inserted some debug print()
statements at both the write and read python call sites, here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/swap_tensor/utils.py#L19-L26
And I observe that when I am writing, the tensor really is trying to write 28824000
bytes (because it has 14412000
elements). However when I do an ls -l /nvme/zero_stage_3/fp16params/rank24/0_param.tensor.swp
I observe that the file only has 28311552
bytes as mentioned in the error message. So it seems that somehow the async write command is failing to properly write the full contents of the tensor.
Any idea why this would happen? Or suggestions for how to debug further?
I have tried looking at kernel logs via dmesg
, but nothing turns up. I have also tried running program with strace -e trace=io_submit,io_getevents,io_setup
, but I only see the io_setup
syscall and not the io_submit
or io_getevents
syscalls.
I do have libaio_dev installed.
My deepspeed config looks like this:
zero_optimization:
stage: 3
stage3_prefetch_bucket_size: 1e9
stage3_param_persistence_threshold: 1e6
stage3_max_live_parameters: 1e9
overlap_comm: true
contiguous_gradients: true
offload_param:
device: nvme
nvme_path: /nvme
pin_memory: false
max_in_cpu: 1e9
buffer_size: 1e9
buffer_count: 5
offload_optimizer:
device: nvme
nvme_path: /nvme
pin_memory: false
Thanks, Stephen
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (9 by maintainers)
@tjruwase yes, I have tested #1086 via commit 38d46848f450a080c8ab96427d9da00c2e5b6327, and it works for me now. Thanks for quick fix!
Actually, everything works fine if I use a single GPU. I think the underlying problem is this line: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L681-L682
i.e.:
While the tensor-size is aligned properly, after dividing by world size the
partition_size
is not aligned.And the
get_buffer()
call here uses thecompute_buffer
not the alignedswap_buffer
: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L689@tjruwase I have identified the problem is here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1027-L1029
Basically all the aio read calls directly in
partitioned_param_swapper.py
are properly aligned after the PR you pointed me to, because the swap-in buffers are explicitly aligned.However in the above code I pointed to, there is still a code-path that directly calls the
swap_in()
function with an explicit (and un-aligned)swap_in_buffers
parameter.Ah, I see the PR you mentioned was just merged yesterday. I tried again, using DeepSpeed commit d88d92799553cb75af536fd4af766ec56e4018cd.
But I still see the same problem:
I do not. I simply tried to use it to create a minimally-reproducing script that demonstrates the bug I am seeing, because running the full training code takes a long time until it reports an error.
EDIT:
Actually upon closer reflection, the error is different now. The aio write is now succesfully writing out the aligned number of bytes. It tries to write
57648128
bytes and it does successfully do that. However the swap-in buffer it is trying to read the parameter into is not aligned. i.e.57648000
is the true number of elements for the tensor, but it is not divisble by 512, so does not equal to the aligned write size, so the assertion above is triggering.