intel-extension-for-pytorch: Arrays larger than 4 GB crashes

Describe the bug

Intel compute runtime doesn’t allow allocating a buffer bigger than 4 GB.

https://github.com/intel/compute-runtime/issues/627

When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?

Versions

Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          20
On-line CPU(s) list:             0-19
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name:                 13th Gen Intel(R) Core(TM) i5-13600K
CPU family:                      6
Model:                           183
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       1
Stepping:                        1
CPU max MHz:                     5100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6991.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       544 KiB (14 instances)
L1i cache:                       704 KiB (14 instances)
L2 cache:                        20 MiB (8 instances)
L3 cache:                        24 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A

About this issue

Original URL
State: open
Created a year ago
Reactions: 1
Comments: 48 (4 by maintainers)

Links to this issue

would you buy an intel gpu (battlemage) if it offered 48gb of vram for $2000

Most upvoted comments

@tye1

Using an image larger than 768x512 in stable diffusion 1.5 results in a blank or garbled image when pytorch doesn’t even use all of the 16 GB in A770 https://github.com/comfyanonymous/ComfyUI/issues/556

Every LLM is bigger than 4GB and they all fail to load on the A770 even though they can fit the VRAM

Other huge models and datasets bigger than 4GB runs out of memory. https://github.com/intel/intel-extension-for-pytorch/issues/421

+12

BA8F0D39 on Sep 20, 2023

Single >4GB VRAM allocations are possible on Arc, but currently they require 2 small workarounds in the application. For OpenCL, these are:

In every cl::Buffer/clCreateBuffer allocation, you have to set the buffer flag bit (1<<23), which in the driver is called CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL.
In cl::Program::build/clBuildProgram, you have to set the compiler option "-cl-intel-greater-than-4GB-buffer-required".

I’ve added this to my OpenCL-Wrapper in this commit, so anything built on top of it works on Arc out-of-the-box.

For Level Zero, the workarounds are similar: https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

I agree that >4GB allocations should be enabled by default. Such a limitation is not contemporary in a time where AI and simulation models commonly use much larger VRAM capacity. Using the full 16GB VRAM capacity of a 16GB GPU has to work no matter what. ISVs should not have to manually add patches only for Arc, to enable basic functionality. Better eliminate this complication and just make it work, and provide the option to disable >4GB allocations for optimization.

ProjectPhysX on Feb 25, 2024

The thing is, if all they’re concerned about is slowdowns, then wouldn’t it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size.

Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I’m sure a lot of people would agree with that. @tye1

Exactly. I mean these 4GB limits (IIRC) have been variously mentioned here (wrt. pytorch programmers), for the OpenCL implementation (OCL programmers), etc.

Ok, so it (the limitation) is something that directly affects GPU/HPC/ML programmers.

As a group that writes HPC / GPGPU code, I think we’re especially used to benchmarking / analyzing / optimizing our code wrt. a myriad of trade-offs as to capability vs. speed vs. complexity etc.

“Oh look I’m going beyond {L1, L2, L3} cache size / cache line / page size – significant performance drop” ok, expected, but often necessary / desired if one needs the added RAM size.

Same thing using RAM vs registers or accessing RAM non sequentially, or about 50 other cases where real world code must / should deviate from the ideal best case performance strategy and must have the flexibility to do it as the programmer decides best at design time or even run-time.

I’d rather the most flexible / capable possibility “just work” easily, and if I have to optimize things somehow (if even possible) then I’ll spend the time to optimize the DSA I used or choose new speed / capability trade-offs if that’s even appropriate.

I’m hoping our RAM / VRAM sizes will keep increasing substantially every generation (16G ARCs now, hopefully 32-48G “ECCd” B / C / D / NV / AMD / whatever cards in months / a year or so to come) so it seems key to be able to actually use (“it just works” style) all that VRAM one has paid for (particularly since IIRC as aforementioned “it just works” on the CPU execution device vs the GPU device having the unusual case limit).

ghchris2021 on Feb 20, 2024

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

Again, thank you for the great work on this project,

But is there any possibility to have it enabled still but with a warning that comes up when exceeding 4GB of allocation and notes that performance would be significantly reduced? I imagine it’s still better than CPU processing, which is the only alternative I (and I’m sure others too) have available.

Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I’m coming from.

I also understand if there’s absolutely nothing you can do, it would just be really rather disappointing. If that is the case, perhaps this is maybe something the ARC/IPEX team could work on to make it a possibility? If not directly possible in this extension that is.

Thank you

ElliottDyson on Jan 4, 2024

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

tye1 on Jan 4, 2024

Sorry for the late response. We disable >4GB memory allocation on ARC770 as there are some hardware limitations on ARC, and there will be significant performance drop as penalty to trade off. This is not acceptable in IPEX’s usage scenarios, hence we have disabled it.

@tye1 This is not acceptable, is a solution that allows >4GB allocation possible?

The advertised VRAM on these cards is 16GB. It’s unacceptable to advertise that memory capacity, have it functional, then later disable it leaving the customer with a GPU they no longer can use.

I was very excited to build a server with 7xA770s for machine learning work. In my previous work with A770 I had no issues, which caused me to invest in $2500 worth of GPUs. Think about how frustrating this is from the customer’s perspective please.

The thing is, if all they’re concerned about is slowdowns, then wouldn’t it be easy enough to embed a warning that these slowdowns occur when transferring data in chunks that are greater than 4GB in size.

Significant slowdowns would mean at least it still works. Some functionality is better than no functionality. I’m sure a lot of people would agree with that. @tye1

ElliottDyson on Feb 17, 2024

I’m ok with >4GB allocation to cause some slowdowns. But if this cannot be implemented at all that means your GPUs are practically (not theoretically or technically) useless for Stable Diffusion and I’m going to sell my A770.

dbenedb on Jan 21, 2024

@tye1 This is also an issue I have. For highly complex models and long sequence lengths, even a batch size of 1 has the possibility of being larger than 4GB. Such limits should be determined by the VRAM capacity of the GPU, rather than in software I would have thought.

ElliottDyson on Dec 24, 2023

Are there any updates on this or is the stance still “we don’t plan to support this”? Asking, since if it’s the latter, I’d be looking to sell my gpu sooner rather than later.

Same here. Machine learning is the only reason I paid extra for a 16gb card.

ElliottDyson on Dec 27, 2023

@cchheennhhaaoo That is not a fix at all tho…

fredlarochelle on Jun 29, 2023

I am still working my way through the Intel ARC documentation with respect to how the global / process / surface / execution unit / physical / virtual etc. etc. addressing works at the architecture level, and I have no full idea of how the multiple Intel driver / compute software layers above the HW affect the memory limitations but I’d like to better understand where these limitations are between the HW / driver / compute SW stack.

It is disappointing for ARC A770-16 to have a 16GBy VRAM GPU and not be (as a programmer) able to easily just access as much data as desired at least anywhere in the card’s VRAM (and also in my host side’s application data RAM while programming, ideally beyond even those limits as I exemplify below (q.v.)).

It makes me concerned for Battlemage, Celestial, Druid as well since apparently the programmer’s model of memory access for the nvidia GPUs has been (IMO) so much better even on their consumer GPUs for several past GPU generations.

I gladly got the ARC A770-16 to use its 16GBy ram for GPGPU and I can see from several intel documents there are at least in parts of the supported architecture capabilities to access 64 bit addresses, and 48 bit virtual addresses, so at first glance I don’t see why there is such a limitation as this now, and I certainly hope that as the Intel GPU line progresses (Battlemage, Celestial, Druid, …) that the VRAM size per offered card model will increase into the 24-64+ GBy range, that the Intel consumer motherboard platforms will evolve to support wider and 256-512GBy RAM, and in such cases it seems that it’s only natural to hope / expect that GPU / CPU virtual addressing can become seamless and extend to the system’s VM size encompassing all system physical RAM and I/O if desired.

From Intel documentation showing mostly hopeful capabilities (though maybe SW is turning some things into SW limitations?):

Graphics Virtual Memory
...Although the range of supported graphics virtual addresses varies, most GPU commands and GPU
instructions use a common 64 bit definition for a graphics virtual address.

Per-Process GTT with 48b VA
The GPU typically operates on behalf of user-level processes (applications), each of which has it's own
"Per-Process" virtual address space. The size of this space is 256TB (48b address width).
...
Shared virtual global memory (SVM)
Accessible and visible to all work items on any GPU and the host.
...
NP STATE_BASE_ADDRESS Base addresses for the Instruction, General State, Surface State, and Bindless Surface State memory heaps. GPGPU kernels reference these memory areas using 32b offsets from the 64b base addresses.
...
Address Models
Data structures accessed by the data port are called "surfaces". There are four different Address Models used by the data port to access these surfaces:
...
64-bit Stateless model (A64).
A64 Flat/Stateless Model
This model is primarily intended for programmable shader programs.

Please see the below just for contrast in terms of what I’d consider (as a developer) a most ideal programming model and therefore the implied capabilities of the SW / HW architecture that goes below it to make it work so seamlessly.

In contrast to the above Intel architecture, looking at this below exemplified case (working already on several generations of consumer NVIDIA GPUs), the developer is able to seamlessly access data anywhere in the VRAM of any of the GPUs in a system, but also CPU memory anywhere in their application’s CPU address space, and in fact also CPU virtual addresses that greatly exceed the physical VRAM of any system GPU / CPU attached RAM, all seamlessly and with efficient / high performance reference to such memory from either CPU application software or GPGPU kernels executing on any of the system’s GPUs.

Here’s the citations about the programmer’s view of memory (as I understand it to be relevant) in competitive (i.e. both for consumer use gaming cards as well as enterprise ones) NVIDIA GPUs. The following citations / sources are about CUDA/NVIDIA GPU’s “Unified Memory” and “Heterogeneous Memory” models on GPUs including their consumer GPUs:

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

Here are small relevant excerpts

The Benefits of Unified Memory on Pascal and Later GPUs

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.

Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern. In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the [CUDA Unified Memory](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/) programming model to include system allocated memory on systems with PCIe-connected NVIDIA GPUs. System allocated memory refers to memory that is ultimately allocated by the operating system; for example, through malloc, mmap, the C++ new operator (which of course uses the preceding mechanisms), or related system routines that set up CPU-accessible memory for the application. 

Previously, on PCIe-based machines, system allocated memory was not directly accessible by the GPU. The GPU could only access memory that came from special allocators such as cudaMalloc or cudaMallocManaged. 

With HMM enabled, all application threads (GPU or CPU) can directly access all of the application’s system allocated memory. As with Unified Memory (which can be thought of as a subset of, or precursor to HMM), there is no need to manually copy system allocated memory between processors. This is because it is automatically placed on the CPU or GPU, based on processor usage.
...
Atomic memory operations and synchronization primitives

HMM supports all memory operations, which includes atomic memory operations. That is, programmers may use atomic memory operations to synchronize GPU and CPU threads with flags. 
...
Leverage memory-mapped I/O for fast development 

One of the interesting features that HMM provides is memory-mapped file I/O directly from the GPU. It enables developers to directly read files from supported storage or /disk without staging them in system memory and without copying the data to the high bandwidth GPU memory. 
...
The ERA5 dataset consists of hourly estimates of several atmospheric variables. In the dataset, total precipitation data for each month is stored in a separate file. We used 40 years of total precipitation data from 1981–2020, which sum to 480 input files aggregating to ~1.3 TB total input data size. See Figure 1 for example results.
...
Using the Unix mmap API, input files can be mapped to a contiguous virtual address space. With HMM, this virtual address can be passed as input to a CUDA kernel which can then directly access the values to build a histogram of total precipitation for each hour for all the days in a year. 
...
Enabling and detecting HMM
A GPU with one of the following supported architectures: NVIDIA Turing, NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, or newer.
...

I’m not sure why we can’t have such a capability of a programming model mapping to efficient HW operations for Arcanist, but I would have expected naturally GPUs with NN GBy VRAM and CPUs with NNN GBy RAM and TBy scale VM could simply access data in physical RAM/VRAM / virtual VM pretty flexibly as exemplified above by now.

IMO it would be nice to see Battlemage, Celestial, Druid, Arcanist improve this aspect of the programming model, and also finally implement SR-IOV so at least we can easily run graphics / compute in a few VMs (after all consumer desktops already virtualize / MMU / IOMMU everything else and have 128+ GBy RAM with 16+ core CPUs etc.).

ghchris2021 on Jan 26, 2024

Again, the only reason I bought this 16GB card was the potential for machine learning, so only being able to use 4/16GB is really rather frustrating, I hope you can see where I’m coming from.

I guess this would only impact the case where you have to allocate big memory chunks which are larger than 4GB. If your workload doesn’t need such big chunk, you can still allocate large enough memory in total up to 16GB (maybe a little lower than that due to the need from runtime/driver)?

Unfortunately that’s the issue, with long sequences and large model sizes when using transformer encoders, my use case requires being able to move more than 4GB in one go. Unless there is a way built into this extension that automatically splits the model into chunks before loading it into memory (same with samples and/or batches)?

P.s. Or even a manual way to do this?

ElliottDyson on Jan 4, 2024

@tye1 Pretty much what @BA8F0D39 said and that you need to use work arounds that you don’t need to use with a Nvidia GPUs. For example, using a smaller batch size and loading multiple separate batch on the GPU, …

The main problem I would say tho is a lot of Pytorch code you can find around the internet simply assume that you can allocate more than 4GB since it’s supported on Nvidia GPUs.

fredlarochelle on Sep 21, 2023

I implemented a W/A in stable-diffusion-webui for scaled_dot_product_attention which is memory intensive (so easily triggers the 4GB limitation on Arc): https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/14353, by slicing large-batch SDPA into smaller chunks.

I’m wondering whether such mechanism could be implemented at IPEX framework level. Adding IPEX W/A in upper level applications is just not scalable.

Nuullll on Jan 16, 2024

Are there any updates on this or is the stance still “we don’t plan to support this”? Asking, since if it’s the latter, I’d be looking to sell my gpu sooner rather than later.

Same here. Machine learning is the only reason I paid extra for a 16gb card.

Same here it’s the ONLY reason I bought this card, first Intel product I’ve bought in 15 years, and it will be the last

go2tom42 on Jan 16, 2024

@tye1 All modern transformer/GAN models are larger than 4GB and they all fail with IPEX https://github.com/intel/intel-extension-for-pytorch/issues/492

BA8F0D39 on Dec 24, 2023

Are there any updates on this or is the stance still “we don’t plan to support this”? Asking, since if it’s the latter, I’d be looking to sell my gpu sooner rather than later.

stacksmash76 on Dec 26, 2023

@BA8F0D39 @fredlarochelle we don’t plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required?

tye1 on Sep 19, 2023

@fredlarochelle @BA8F0D39 After next code synchronization, memory allocation greater than 4G will be disabled on Arc and an error message will be raised when user requests it.

cchheennhhaaoo on Jun 25, 2023