intel-extension-for-pytorch: Arrays larger than 4 GB crashes
Describe the bug
Intel compute runtime doesn’t allow allocating a buffer bigger than 4 GB.
https://github.com/intel/compute-runtime/issues/627
When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.
x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')
Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?
Versions
Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35
Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: 13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name: 13th Gen Intel(R) Core(TM) i5-13600K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 1
CPU max MHz: 5100.0000
CPU min MHz: 800.0000
BogoMIPS: 6991.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 544 KiB (14 instances)
L1i cache: 704 KiB (14 instances)
L2 cache: 20 MiB (8 instances)
L3 cache: 24 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 48 (4 by maintainers)
@tye1
Using an image larger than 768x512 in stable diffusion 1.5 results in a blank or garbled image when pytorch doesn’t even use all of the 16 GB in A770 https://github.com/comfyanonymous/ComfyUI/issues/556
Every LLM is bigger than 4GB and they all fail to load on the A770 even though they can fit the VRAM
Other huge models and datasets bigger than 4GB runs out of memory. https://github.com/intel/intel-extension-for-pytorch/issues/421
Single >4GB VRAM allocations are possible on Arc, but currently they require 2 small workarounds in the application. For OpenCL, these are:
cl::Buffer
/clCreateBuffer
allocation, you have to set the buffer flag bit(1<<23)
, which in the driver is calledCL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL
.cl::Program::build
/clBuildProgram
, you have to set the compiler option"-cl-intel-greater-than-4GB-buffer-required"
.I’ve added this to my OpenCL-Wrapper in this commit, so anything built on top of it works on Arc out-of-the-box.
For Level Zero, the workarounds are similar: https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md
I agree that >4GB allocations should be enabled by default. Such a limitation is not contemporary in a time where AI and simulation models commonly use much larger VRAM capacity. Using the full 16GB VRAM capacity of a 16GB GPU has to work no matter what. ISVs should not have to manually add patches only for Arc, to enable basic functionality. Better eliminate this complication and just make it work, and provide the option to disable >4GB allocations for optimization.
Exactly. I mean these 4GB limits (IIRC) have been variously mentioned here (wrt. pytorch programmers), for the OpenCL implementation (OCL programmers), etc.
Ok, so it (the limitation) is something that directly affects GPU/HPC/ML programmers.
As a group that writes HPC / GPGPU code, I think we’re especially used to benchmarking / analyzing / optimizing our code wrt. a myriad of trade-offs as to capability vs. speed vs. complexity etc.
“Oh look I’m going beyond {L1, L2, L3} cache size / cache line / page size – significant performance drop” ok, expected, but often necessary / desired if one needs the added RAM size.
Same thing using RAM vs registers or accessing RAM non sequentially, or about 50 other cases where real world code must / should deviate from the ideal best case performance strategy and must have the flexibility to do it as the programmer decides best at design time or even run-time.
I’d rather the most flexible / capable possibility “just work” easily, and if I have to optimize things somehow (if even possible) then I’ll spend the time to optimize the DSA I used or choose new speed / capability trade-offs if that’s even appropriate.
I’m hoping our RAM / VRAM sizes will keep increasing substantially every generation (16G ARCs now, hopefully 32-48G “ECCd” B / C / D / NV / AMD / whatever cards in months / a year or so to come) so it seems key to be able to actually use (“it just works” style) all that VRAM one has paid for (particularly since IIRC as aforementioned “it just works” on the CPU execution device vs the GPU device having the unusual case limit).
Again, thank you for the great work on this project,
But is there any possibility to have it enabled still but with a warning that comes up when exceeding 4GB of allocation and notes that performance would be significantly reduced? I imagine it’s still better than CPU processing, which is the only alternative I (and I’m sure others too) have available.
Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I’m coming from.
I also understand if there’s absolutely nothing you can do, it would just be really rather disappointing. If that is the case, perhaps this is maybe something the ARC/IPEX team could work on to make it a possibility? If not directly possible in this extension that is.
Thank you
Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.
The thing is, if all they’re concerned about is slowdowns, then wouldn’t it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size.
Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I’m sure a lot of people would agree with that. @tye1
I’m ok with >4GB allocation to cause some slowdowns. But if this cannot be implemented at all that means your GPUs are practically (not theoretically or technically) useless for Stable Diffusion and I’m going to sell my A770.
@tye1 This is also an issue I have. For highly complex models and long sequence lengths, even a batch size of 1 has the possibility of being larger than 4GB. Such limits should be determined by the VRAM capacity of the GPU, rather than in software I would have thought.
Same here. Machine learning is the only reason I paid extra for a 16gb card.
@cchheennhhaaoo That is not a fix at all tho…
I am still working my way through the Intel ARC documentation with respect to how the global / process / surface / execution unit / physical / virtual etc. etc. addressing works at the architecture level, and I have no full idea of how the multiple Intel driver / compute software layers above the HW affect the memory limitations but I’d like to better understand where these limitations are between the HW / driver / compute SW stack.
It is disappointing for ARC A770-16 to have a 16GBy VRAM GPU and not be (as a programmer) able to easily just access as much data as desired at least anywhere in the card’s VRAM (and also in my host side’s application data RAM while programming, ideally beyond even those limits as I exemplify below (q.v.)).
It makes me concerned for Battlemage, Celestial, Druid as well since apparently the programmer’s model of memory access for the nvidia GPUs has been (IMO) so much better even on their consumer GPUs for several past GPU generations.
I gladly got the ARC A770-16 to use its 16GBy ram for GPGPU and I can see from several intel documents there are at least in parts of the supported architecture capabilities to access 64 bit addresses, and 48 bit virtual addresses, so at first glance I don’t see why there is such a limitation as this now, and I certainly hope that as the Intel GPU line progresses (Battlemage, Celestial, Druid, …) that the VRAM size per offered card model will increase into the 24-64+ GBy range, that the Intel consumer motherboard platforms will evolve to support wider and 256-512GBy RAM, and in such cases it seems that it’s only natural to hope / expect that GPU / CPU virtual addressing can become seamless and extend to the system’s VM size encompassing all system physical RAM and I/O if desired.
From Intel documentation showing mostly hopeful capabilities (though maybe SW is turning some things into SW limitations?):
Please see the below just for contrast in terms of what I’d consider (as a developer) a most ideal programming model and therefore the implied capabilities of the SW / HW architecture that goes below it to make it work so seamlessly.
In contrast to the above Intel architecture, looking at this below exemplified case (working already on several generations of consumer NVIDIA GPUs), the developer is able to seamlessly access data anywhere in the VRAM of any of the GPUs in a system, but also CPU memory anywhere in their application’s CPU address space, and in fact also CPU virtual addresses that greatly exceed the physical VRAM of any system GPU / CPU attached RAM, all seamlessly and with efficient / high performance reference to such memory from either CPU application software or GPGPU kernels executing on any of the system’s GPUs.
Here’s the citations about the programmer’s view of memory (as I understand it to be relevant) in competitive (i.e. both for consumer use gaming cards as well as enterprise ones) NVIDIA GPUs. The following citations / sources are about CUDA/NVIDIA GPU’s “Unified Memory” and “Heterogeneous Memory” models on GPUs including their consumer GPUs:
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
Here are small relevant excerpts
I’m not sure why we can’t have such a capability of a programming model mapping to efficient HW operations for Arcanist, but I would have expected naturally GPUs with NN GBy VRAM and CPUs with NNN GBy RAM and TBy scale VM could simply access data in physical RAM/VRAM / virtual VM pretty flexibly as exemplified above by now.
IMO it would be nice to see Battlemage, Celestial, Druid, Arcanist improve this aspect of the programming model, and also finally implement SR-IOV so at least we can easily run graphics / compute in a few VMs (after all consumer desktops already virtualize / MMU / IOMMU everything else and have 128+ GBy RAM with 16+ core CPUs etc.).
Unfortunately that’s the issue, with long sequences and large model sizes when using transformer encoders, my use case requires being able to move more than 4GB in one go. Unless there is a way built into this extension that automatically splits the model into chunks before loading it into memory (same with samples and/or batches)?
P.s. Or even a manual way to do this?
@tye1 Pretty much what @BA8F0D39 said and that you need to use work arounds that you don’t need to use with a Nvidia GPUs. For example, using a smaller batch size and loading multiple separate batch on the GPU, …
The main problem I would say tho is a lot of Pytorch code you can find around the internet simply assume that you can allocate more than 4GB since it’s supported on Nvidia GPUs.
I implemented a W/A in stable-diffusion-webui for
scaled_dot_product_attention
which is memory intensive (so easily triggers the 4GB limitation on Arc): https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/14353, by slicing large-batch SDPA into smaller chunks.I’m wondering whether such mechanism could be implemented at IPEX framework level. Adding IPEX W/A in upper level applications is just not scalable.
Same here it’s the ONLY reason I bought this card, first Intel product I’ve bought in 15 years, and it will be the last
@tye1 All modern transformer/GAN models are larger than 4GB and they all fail with IPEX https://github.com/intel/intel-extension-for-pytorch/issues/492
Are there any updates on this or is the stance still “we don’t plan to support this”? Asking, since if it’s the latter, I’d be looking to sell my gpu sooner rather than later.
@BA8F0D39 @fredlarochelle we don’t plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required?
@fredlarochelle @BA8F0D39 After next code synchronization, memory allocation greater than 4G will be disabled on Arc and an error message will be raised when user requests it.