rmm: [BUG] Fragmentation with cuda_async_memory_resource
Describe the bug rmm::mr::cuda_async_memory_resource showing signs of pretty severe fragmentation after allocating / deallocating ~300GB on an 80GB GPU. I was able to perform a series of allocations, all on a single thread, using the the synchronous cuda_memory_resource without running out of memory. It runs out of memory when there are ~30GB of memory that are available but not able to allocate ~110MB.
cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used ); Gives a usage of 53514136688
cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used ); Shows a reservation of 84490059776
A tracking_resource_adaptor.get_allocated_bytes() shows 53514136688
I am failing to allocate 132870728 bytes
Steps/Code to reproduce bug rmm_test.zip There is now a simpler test found in rmm_simple_test.cpp. It just performs allocate and deallocate.
I have attached a Test which is similar to rmm_replay except that when it runs out of memory it first synchronizes and tries again, if it still fails it queues the allocation and the frees for those corresponding allocations until we free up 8x the last allocation failure. At which point it resumes by pushing the queued allocations to the front of the allocation list and picking back up. If the amount of allocations that gets queued grows above a certain threshold, currently set at 20GB using kMaxAllocationBytesQueue, then it just fails and stops trying. The test consists of a single cpp, a cmake file, and the rmm_log file which I am replaying to generate the error.
The binary is launched using
./rmm_test {path to log file} {cuda_async | bin_small_async_large}
When it fails to allocate it will output the measurements I showed above in the bug description section.
Expected behavior I would expect the allocator to be able to handle fragmentation particularly when so much of the pool is still free. I also expect that it should still be able to handle fragmentation under the hood. From this blog We can read the text:
If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.
Due to this I would expect there not to be fragmentation issues. With my current use case
**Environment details **
- Environment location: Bare Metal
- Method of RMM install: conda
Additional context There are two variants I tried using cuda_async_memory_resource. One that bins all allocations smaller than 4194304 with a pool_memory_resource<cuda_memory_resource> and a cuda_async_memory_resource for larger allocations and one that just uses cuda_async_memory_resource for all allocations.
I figured removing the smaller allocations would reduce the incidence of fragmentation. This doesn’t seem to have been enough. I am trying to figure out if there is anything I can do at run time to make sure the pool gets defragmented.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 36 (36 by maintainers)
Driver team reported they were able to reproduce with the new example and haven’t identified the root cause yet.
Also note that in a real use case where you’re allocating on multiple threads all using PTDS, you may need to synchronize every thread’s PTDS when an allocation fails in order to ensure the maximum amount of memory is available.
Arena memory resource does worst than pool (it makes the least progress in the allocation list). Pool and cuda_async_memory_resource are not too far off from one another.