intel-extension-for-pytorch: F32 Example Training Gets Stuck after One Iteration of For Loop

(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H ./build.sh
[a bunch of output here]
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ IMAGE_NAME=intel-extension-for-pytorch:gpu
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ VIDEO=$(getent group video | sed -E 's,^video:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ test -z "$RENDER" || RENDER_GROUP="--group-add ${RENDER}"
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H docker run --rm -v /home/tedliosu/intel_pytorch_workspace:/workspace --group-add ${VIDEO} ${RENDER_GROUP} --device=/dev/dri --ipc=host -it $IMAGE_NAME bash
[sudo] password for tedliosu: 
groups: cannot find name for group ID 109
root@8e852a62c8b4:/# cd workspace/
root@d4958d53cb7c:/workspace# python3 -m trace -t ipex_f32_example.py 2>&1 | tee ipex_f32_example_py_trace.txt | grep ipex_f32_example
 --- modulename: ipex_f32_example, funcname: <module>
ipex_f32_example.py(1): import torch
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(2): import torchvision
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(4): import intel_extension_for_pytorch as ipex
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(7): LR = 0.001
ipex_f32_example.py(8): DOWNLOAD = True
ipex_f32_example.py(9): DATA = 'datasets/cifar10/'
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(12):     torchvision.transforms.Resize((224, 224)),
ipex_f32_example.py(13):     torchvision.transforms.ToTensor(),
ipex_f32_example.py(14):     torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(17):         root=DATA,
ipex_f32_example.py(18):         train=True,
ipex_f32_example.py(19):         transform=transform,
ipex_f32_example.py(20):         download=DOWNLOAD,
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(23):         dataset=train_dataset,
ipex_f32_example.py(24):         batch_size=128
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(27): model = torchvision.models.resnet50()
ipex_f32_example.py(28): criterion = torch.nn.CrossEntropyLoss().to("xpu")
ipex_f32_example.py(29): optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
ipex_f32_example.py(30): model.train()
ipex_f32_example.py(32): model = model.to("xpu")
ipex_f32_example.py(33): model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
ipex_f32_example.py(37):     print("Begin 1 loop iteration")
ipex_f32_example.py(39):     data = data.to("xpu")
ipex_f32_example.py(40):     print("Moved data onto XPU")
ipex_f32_example.py(41):     target = target.to("xpu")
ipex_f32_example.py(42):     print("Moved target onto XPU")
ipex_f32_example.py(44):     optimizer.zero_grad()
ipex_f32_example.py(45):     print("About to apply model to data")
ipex_f32_example.py(46):     output = model(data)
ipex_f32_example.py(47):     print("Finished applying model to data")
ipex_f32_example.py(48):     loss = criterion(output, target)
ipex_f32_example.py(49):     print("About to execute loss.backward()")
ipex_f32_example.py(50):     loss.backward()
ipex_f32_example.py(51):     print("About to execute optimizer.step()")
ipex_f32_example.py(52):     optimizer.step()
ipex_f32_example.py(53):     print("Current batch id : %d" % (batch_idx))
ipex_f32_example.py(54):     data = None
ipex_f32_example.py(55):     target = None
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
[I killed the process after ***90 minutes*** of being stuck here]
root@d4958d53cb7c:/workspace# tail -n35 ipex_f32_example_py_trace.txt
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
root@d4958d53cb7c:/workspace# pip list
Package                     Version            
--------------------------- -------------------
contourpy                   1.0.6              
cycler                      0.11.0             
fonttools                   4.38.0             
intel-extension-for-pytorch 1.10.200+gpu       
kiwisolver                  1.4.4              
matplotlib                  3.6.1              
numpy                       1.23.4             
packaging                   21.3               
Pillow                      9.3.0              
pip                         20.0.2             
pyparsing                   3.0.9              
python-dateutil             2.8.2              
setuptools                  45.2.0             
six                         1.16.0             
torch                       1.10.0a0+git3d5f2d4
torchvision                 0.11.3             
typing-extensions           4.4.0              
wheel                       0.34.2

Contents of ipex_f32_example.py (as you can see it’s basically the Float32 example from here):

import torch
import torchvision
############# code changes ###############
import intel_extension_for_pytorch as ipex
############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = 'datasets/cifar10/'

transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize((224, 224)),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = torchvision.datasets.CIFAR10(
        root=DATA,
        train=True,
        transform=transform,
        download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset,
        batch_size=128
)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss().to("xpu")
optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
model.train()
#################################### code changes ################################
model = model.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
#################################### code changes ################################

for batch_idx, (data, target) in enumerate(train_loader):
    print("Begin 1 loop iteration")
    ########## code changes ##########
    data = data.to("xpu")
    print("Moved data onto XPU")
    target = target.to("xpu")
    print("Moved target onto XPU")
    ########## code changes ##########
    optimizer.zero_grad()
    print("About to apply model to data")
    output = model(data)
    print("Finished applying model to data")
    loss = criterion(output, target)
    print("About to execute loss.backward()")
    loss.backward()
    print("About to execute optimizer.step()")
    optimizer.step()
    print("Current batch id : %d" % (batch_idx))
    data = None
    target = None
torch.save({
     'model_state_dict': model.state_dict(),
     'optimizer_state_dict': optimizer.state_dict(),
     }, 'checkpoint.pth')

As you can see in the command line output I noted that the ipex_f32_example.py script basically froze for 90 minutes when I was running it after it reached the for batch_idx, (data, target) in enumerate(train_loader): line; when I was running it without the tracing it froze at data = data.to("xpu") for over 8 hours before I had to simply kill the process. I have no idea if this is a driver issue or a torchvision issue or whatever, but this is really annoying and I’d be more than happy to provide extra info about my system to help solve this freezing problem. Also note that tail -n35 ipex_f32_example_py_trace.txt displays the last 35 lines of the trace I ran on the script to see exactly where the execution of the script is freezing.

P.S. since I already mentioned this issue in here before I made this separate issue, I saw the reply here to my initial comment about this issue but I have no idea how to apply that person’s comment to help solve this issue 😕

About this issue

Original URL
State: open
Created 2 years ago
Comments: 20 (9 by maintainers)

Most upvoted comments

Hi @tedliosu! Thanks again for the info! We’ll investigate this issue while enabling Intel Extension for PyTorch for iGPUs. iGPUs are currently unsupported.

sanchitintel on Nov 7, 2022

Awesome! Thanks, @tedliosu! Looks like your iGPU is indeed tgllp! 😃

sanchitintel on Nov 1, 2022

Hi @tedliosu, thanks for checking it out! While the link you provided does not have the latest info (it doesn’t even list discrete GPUs), you’re right that you might have to use tgllp! Since we don’t currently have official support for iGPUs, we haven’t gotten a chance to check them out.

Can you please use dpcpp instead of icpx for your example, BTW? Are you using the latest oneAPI Basekit, BTW? If so, ocloc compile --help would show you even more device targets.

Thanks to you taking this initiative, your solution here would also help others! 😃

sanchitintel on Nov 1, 2022

Thanks for your interest in Intel Extension for PyTorch, @tedliosu! We look forward to your response!

As @jingxu10 also mentioned, the current whls are for Flex Series 170 GPUs (which are discrete GPUs similar to Intel Arc Alchemist series GPUs). Just FYI, AOT (or USE_AOT_DEVLIST) builds GPU kernels for the target device (in this case, the whls were generated for Discrete GPUs), so they’d not work with your iGPU, and you’d have to build from source for your own GPU.

sanchitintel on Nov 1, 2022

There are several things involved.

Computation power of iGPU is not compatible with that of Flex Series 170. It is expected performance is slow.
The prebuilt wheel files are AOTed for Flex Series 170. AOT is a technique that generates executable binaries in the wheel file during compilation time. On Non-Flex Series 170 gpus, IPEX will generate executable binaries in runtime first and then executes them. This also contributes to a “freezing” when the script is executed.
oneDNN JIT technique, that I described in your previous thread, also takes time generating executable binaries in runtime.

As Sanchit mentioned in the reply above, you can try compiling ipex from source with AOT configured for your graphics card. What needs to mention is that the official support to IPEX GPU currently is only on Flex Series 170.

jingxu10 on Nov 1, 2022

This current release is for Discrete Graphics cards. While it only mentions Flex Series 170 GPU, it also supports the Intel Arc Alchemist series GPUs.

Intel Extension for PyTorch is currently not officially supported for integrated GPUs. We may support them in the near future, though. However, if you’d like, you’d be able to build from source with these instructions, except that you’d have to set the environment variable USE_AOT_DEVLIST as xe, or modify USE_AOT_DEVLIST in CMakeLists.txt:

set(USE_AOT_DEVLIST "xe" CACHE STRING "Set device list for AOT build (for example, skl,ats,...)")

sanchitintel on Nov 1, 2022

There are several things involved.
1. Computation power of iGPU is not compatible with that of Flex Series 170. It is expected performance is slow.

2. The prebuilt wheel files are AOTed for Flex Series 170. AOT is a technique that generates executable binaries in the wheel file during compilation time. On Non-Flex Series 170 gpus, IPEX will generate executable binaries in runtime first and then executes them. This also contributes to a "freezing" when the script is executed.

3. oneDNN JIT technique, that I described in your previous thread, also takes time generating executable binaries in runtime.
As Sanchit mentioned in the reply above, you can try compiling ipex from source with AOT configured for your graphics card. What needs to mention is that the official support to IPEX GPU currently is only on Flex Series 170.

@jingxu10 Thank you so much for the info! I’ll try building from source and testing it out within the next week or so as I’m very busy with school right now 😄

tedliosu on Nov 1, 2022