anomalib: [Bug]: Patchcore exported ONNX file is not usable

Describe the bug

Hi all! Has anyone tried to do inference with the Patchcore exported ONNX from anomalib with onnxruntime for example? The model is apprently buggy as it is asking for an insane amount of memory (haven’t been able to run it on 80GB machine on CPU for example). The error I keep getting is :
onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632. The problematic layer is apparently this sub node. image Does anyone has any clue how to fix it? or is there any workaround?

Dataset

Folder

Model

PatchCore

Steps to reproduce the behavior

1 - Install Anomalib 2 - Train a Patchcore model 3 - Try to infer with the ONNX

OS information

OS information:

  • OS: [e.g. Ubuntu 20.04]
  • Python version: [e.g. 3.8.10]
  • Anomalib version: [e.g. 0.4.0]
  • PyTorch version: [e.g. 1.9.0]
  • CUDA/cuDNN version: [e.g. 11.4]
  • GPU models and configuration: [1x A100]

Expected behavior

exported ONNX model from patchcore is not working.

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

0.4.0

Configuration YAML

model:
  name: patchcore
  backbone: wide_resnet50_2
  pre_trained: true
  layers:
    - layer2
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: onnx #options: onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Logs

onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

@hgaiser thanks for this. This seems to be the problem, and from what I see here, I think cdist might even work, but that unsqueeze on dimension 1 needs to be changed as well as squeeze added to memory bank. Basically what you did with squeezing, but using cdist. Now I’m not sure if this would really work, as I’ll need to test it and I’m not that familiar with patchcore code and cdist, but I’ll test it and report back. It’d also be great to hear your findings with your function.

I thought the same and I tried that, but it still allocated the wrong size of memory. My model with the patch applied seems to work the same as the pytorch version, but I need to fix something in the training itself. The patch seems to resolve my issue, so my investigation stops here for now since I have limited time to work on this. I hope someone can pick it up and make a proper fix for it.

I’m running into the same issue, trying to debug what is happening. I noticed that my memory bank is shaped [48000, 1536] (images are shaped [400, 400]) and that the amount of memory it tries to allocate is 737280000000 bytes. It probably isn’t a coincidence that 48000 * 1536 = 73728000. I would expect it to allocate 48000 * 1536 * 4 bytes though, as the memory bank has dtype float32, not 48000 * 1536 * 10000 bytes. I’m trying to find out where this issue comes from … I’ll let you know if I find anything, but wanted to share my findings in the meantime.

I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in compute_anomaly_score and not in nearest_neighbor.

However if the above solution also works, it’s probably a safer bet 😃.

Hello @hgaiser ,Thank you very much for your answer, just @jasonvanzelm provided a new solution idea, you can also try. I am testing it now and hope the modification will work once.

Numerous methods are provided in Anomalib, and it seems that PatchCore performs well among all the models that can be trained just once to get there. But I checked the related material with the original paper and found that Padim using Wide_ResNet 50 also performs well. But strangely, the model cannot be trained correctly after I set the backbone of Padim to wide_ResNet 50, you can refer to #1045 for more details.

I wonder if you can provide some ideas for this problem, thank you very much!

Good to hear that it works. We’ll see what to do from here on and try to implement this fix and test it. Thanks for all the input 😃

I’ve narrowed it down quite a bit to this line:

https://github.com/openvinotoolkit/anomalib/blob/main/src/anomalib/models/patchcore/torch_model.py#L191

It seems that in the ONNX representation (and apparently also OpenVINO?), the input for cdist is shaped 1, 48000, 1536 (called %onnx::Sub_770 in ONNX), whereas it is shaped 48000, 1536 according to pytorch. The other input (%/Reshape_output_0) is shaped 2500, 1536, but seems to get unsqueezed to 2500, 1, 1536 (presumably to match the other input). The calculated output shape of cdist is then 2500, 48000, 1536, which is way too large.

I believe the Sub_770 tensor should be squeezed so that it doesn’t have three dimensions … but not sure where this happens. I’m not entirely sure on the details, but at the moment I have this diff (thanks to https://github.com/openvinotoolkit/anomalib/issues/440#issuecomment-1191184221):

diff --git a/src/anomalib/models/patchcore/torch_model.py b/src/anomalib/models/patchcore/torch_model.py
index 7f4f11f5..00185d43 100644
--- a/src/anomalib/models/patchcore/torch_model.py
+++ b/src/anomalib/models/patchcore/torch_model.py
@@ -18,6 +18,14 @@ from anomalib.models.patchcore.anomaly_map import AnomalyMapGenerator
 from anomalib.pre_processing import Tiler


+def my_cdist(x1, x2):
+    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
+    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
+    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
+    res = res.clamp_min_(1e-30).sqrt_()
+    return res
+
+
 class PatchcoreModel(DynamicBufferModule, nn.Module):
     """Patchcore Module."""

@@ -153,7 +161,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
             Tensor: Patch scores.
             Tensor: Locations of the nearest neighbor(s).
         """
-        distances = torch.cdist(embedding, self.memory_bank, p=2.0)  # euclidean norm
+        distances = my_cdist(embedding, self.memory_bank)  # euclidean norm
         if n_neighbors == 1:
             # when n_neighbors is 1, speed up computation by using min instead of topk
             patch_scores, locations = distances.min(1)
@@ -188,7 +196,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
         # indices of N_b(m^*) in the paper
         _, support_samples = self.nearest_neighbors(nn_sample, n_neighbors=self.num_neighbors)
         # 4. Find the distance of the patch features to each of the support samples
-        distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0)
+        distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze())
         # 5. Apply softmax to find the weights
         weights = (1 - F.softmax(distances.squeeze(1), 1))[..., 0]
         # 6. Apply the weight factor to the score

I haven’t yet checked if the output of the ONNX model is correct, but at the very least it runs. I will check the output tomorrow.

I’ve encountered the same issue,

My current workaround is to either use the .ckpt file with the pytorch lightning interpreter or extract the memory_bank as a numpy array, define a custom PatchCoreModel wrapper and manually load & store the memory_bank as a tensor.

I was only able to run the ONNX file on a server with 64 GB RAM when using an input size of 224 and a sampling rate of 0.01, so that I get ~6.7k entries (shape of 6700x1536) for the memory_bank.

Using different Docker images or Interpreters made no difference (OpenVINO or ONNX Runtime) Tried these with a 3090: https://github.com/microsoft/onnxruntime/blob/main/dockerfiles/Dockerfile.cuda nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

Hello, I’m not sure what the problem is, but I can reproduce this. It definitely isn’t normal that model requests 288gigs of memory, but I’m not entirely sure if that node is the only problem.