anomalib: Can't run on Mac M1 - Cannot convert a MPS Tensor to float64
Describe the bug
I get a “Cannot convert a MPS Tensor to float64” when running the train.py script on a Mac M1
It seems that the Mac GPU interface can’t handle 64bit tensors… I am unsure where to cast the tensor or how to properly do it but from what I can tell data is loaded in ligthining_fabric/apply_func.py. I tried changing stuff to “data_output= data.type(torch.float32).to(device, **kwargs)” (~line 95) but this does not work. Looking forward to any help 😃
Regards JI
Dataset
MVTec
Model
PADiM
Steps to reproduce the behavior
On a Mac M1 / Apple Silicon:
- Install as defined in the how to
- load dataset and put it in the correct folder
- Run the train.py (
OS information
OS information:
- OS: Mac OS Ventura 13.5
- Python version: 3.10.13
- Anomalib version: 1.0dev
- PyTorch version: 2.1.1
- GPU models and configuration: MPS
Expected behavior
A working training to get going with this cool lib
Screenshots
No response
Pip/GitHub
GitHub
What version/branch did you use?
1.0dev
Configuration YAML
dataset:
name: mvtec
format: mvtec
path: ./datasets/MVTec
category: bottle
task: segmentation
train_batch_size: 32
eval_batch_size: 32
num_workers: 8
image_size: 256 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
tiling:
apply: false
tile_size: null
stride: null
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16
model:
name: padim
backbone: resnet18
pre_trained: true
layers:
- layer1
- layer2
- layer3
normalization_method: min_max # options: [none, min_max, cdf]
metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null
visualization:
show_images: True # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: null # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 42
path: ./results
logging:
logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
log_graph: false # Logs the model graph to respective logger.
optimization:
export_mode: null # options: torch, onnx, openvino
# PL Trainer Args. Don't add extra parameter here.
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1 # Don't validate before extracting features.
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0 # Don't validate before extracting features.
log_every_n_steps: 50
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
Logs
python tools/train.py --config src/anomalib/models/padim/custom_config.yaml
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/anomalib/src/anomalib/config/config.py:280: UserWarning: config.project.unique_dir is set to False. This does not ensure that your results will be written in an empty directory and you may overwrite files.
warn(
Global seed set to 42
2023-12-08 17:43:18,916 - anomalib.data - INFO - Loading the datamodule
2023-12-08 17:43:18,917 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-12-08 17:43:18,917 - anomalib.data.utils.transform - INFO - No config file has been provided. Using default transforms.
2023-12-08 17:43:18,917 - anomalib.models - INFO - Loading the model.
2023-12-08 17:43:18,917 - anomalib.models.components.base.anomaly_module - INFO - Initializing PadimLightning model.
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
2023-12-08 17:43:18,920 - anomalib.models.components.feature_extractors.timm - WARNING - FeatureExtractor is deprecated. Use TimmFeatureExtractor instead. Both FeatureExtractor and TimmFeatureExtractor will be removed in a future release.
2023-12-08 17:43:19,209 - timm.models.helpers - INFO - Loading pretrained weights from url (https://download.pytorch.org/models/resnet18-5c106cde.pth)
2023-12-08 17:43:19,294 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2023-12-08 17:43:19,294 - anomalib.utils.callbacks - INFO - Loading the callbacks
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/anomalib/src/anomalib/utils/callbacks/__init__.py:153: UserWarning: Export option: None not found. Defaulting to no model export
warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (mps), used: True
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2023-12-08 17:43:19,324 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2023-12-08 17:43:19,324 - anomalib - INFO - Training the model.
2023-12-08 17:43:19,327 - anomalib.data.mvtec - INFO - Found the dataset.
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:613: UserWarning: Checkpoint directory results/padim/mvtec/bottle/run/weights/lightning exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py:183: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
rank_zero_warn(
2023-12-08 17:43:19,428 - pytorch_lightning.callbacks.model_summary - INFO -
| Name | Type | Params
-------------------------------------------------------------------
0 | image_threshold | AnomalyScoreThreshold | 0
1 | pixel_threshold | AnomalyScoreThreshold | 0
2 | model | PadimModel | 2.8 M
3 | image_metrics | AnomalibMetricCollection | 0
4 | pixel_metrics | AnomalibMetricCollection | 0
5 | normalization_metrics | MinMax | 0
-------------------------------------------------------------------
2.8 M Trainable params
0 Non-trainable params
2.8 M Total params
11.131 Total estimated model params size (MB)
Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/anomalib/tools/train.py", line 79, in <module>
train(args)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/anomalib/tools/train.py", line 64, in train
trainer.fit(model=model, datamodule=datamodule)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 187, in advance
batch = next(data_fetcher)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
return self.fetching_function()
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 275, in fetching_function
return self.move_to_device(batch)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 294, in move_to_device
batch = self.batch_to_device(batch)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 261, in batch_to_device
batch = self.trainer._call_strategy_hook("batch_to_device", batch, dataloader_idx=0)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 273, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 342, in _apply_batch_transfer_handler
batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 330, in _call_batch_hook
return trainer_method(hook_name, *args)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py", line 632, in transfer_batch_to_device
return move_data_to_device(batch, device)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py", line 102, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 72, in apply_to_collection
return _apply_to_collection_slow(
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
v = _apply_to_collection_slow(
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
return function(data, *args, **kwargs)
File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py", line 95, in batch_to
data_output = data.to(device, **kwargs)
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
Epoch 0: 0%| | 0/10 [00:17<?, ?it/s]
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 16 (7 by maintainers)
You could try changing the
acceleratorinside trainer section of config fromautotocpu, maybe that solves the issue.If you want to solve the visualisation issue you should add matplotlib.use(‘Agg’) to the top of the visualizer.py file. There is some issue with Mac Os and that’s why you need to specify this matplotlib backend. It worked for me, I hope it will work for you too
@jahad9819jjj, yeah you are right, the problem was not fully resolved. I’ve also spotted this, and created #1644 . Once it is merged, it should hopefully be ok 😃
I had the same visualization error when I tried to log image results out. I have tried to run the same code in Linux environment, and everything works just fine,
if you change the config file as follows, you should be able to log some image results out.
I am using macOS 12.3.1 with miniconda env
Hallo blaz-r,
And first of all thanks for the tip! That was exactly the hint I needed 😉
While it made the model run for a bit, I got a follow up crash with a reshape problem in
return visualization.generate() File "/Users/justiniszatt/Desktop/Programming/python/anomalib_env/anomalib/src/anomalib/post_processing/visualizer.py", line 287, in generate img = img.reshape(self.figure.canvas.get_width_height()[::-1] + (3,)) ValueError: cannot reshape array of size 15000000 into shape (500,2500,3)I already played around with the visualisation part of the config but had no luck so far -> any tips here would also be appreciatedRegarding the MPS, it still might be interesting to further investigate as the processing times are suboptimal on CPU… I tried to convert some of the tensors to float32 but had no luck at all