autogluon: [BUG] RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'
- I have checked that this bug exists on the latest stable version of AutoGluon
- and/or I have checked that this bug exists on the latest mainline of AutoGluon via source installation
Describe the bug Bug when starting .fit() method. Traceback:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module> │
│ │
│ 2 │ label="label", │
│ 3 # path="/kaggle/working/AutogluonModels/ag-20221214_131455" │
│ 4 ) │
│ ❱ 5 predictor.fit( │
│ 6 │ train_data=train_data, │
│ 7 │ time_limit=60*60*12, # seconds, │
│ 8 ) # you can trust the default config, e.g., we use a `swin_base_patch4_window7_224` mode │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\predictor.py:839 in │
│ fit │
│ │
│ 836 │ │ │ ) │
│ 837 │ │ │ return predictor │
│ 838 │ │ │
│ ❱ 839 │ │ self._fit(**_fit_args) │
│ 840 │ │ training_end = time.time() │
│ 841 │ │ self._total_train_time = training_end - training_start │
│ 842 │ │ logger.info(f"Models and intermediate outputs are saved to {self._save_path} ") │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\predictor.py:1386 in │
│ _fit │
│ │
│ 1383 │ │ │ │ ".* in the `DataLoader` init to improve performance.*", │
│ 1384 │ │ │ ) │
│ 1385 │ │ │ warnings.filterwarnings("ignore", "Checkpoint directory .* exists and is not │
│ ❱ 1386 │ │ │ trainer.fit( │
│ 1387 │ │ │ │ task, │
│ 1388 │ │ │ │ datamodule=train_dm, │
│ 1389 │ │ │ │ ckpt_path=ckpt_path if resume else None, # this is to resume training t │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:696 in │
│ fit │
│ │
│ 693 │ │ │ datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn │
│ 694 │ │ """ │
│ 695 │ │ self.strategy.model = model │
│ ❱ 696 │ │ self._call_and_handle_interrupt( │
│ 697 │ │ │ self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_ │
│ 698 │ │ ) │
│ 699 │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:650 in │
│ _call_and_handle_interrupt │
│ │
│ 647 │ │ │ if self.strategy.launcher is not None: │
│ 648 │ │ │ │ return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, ** │
│ 649 │ │ │ else: │
│ ❱ 650 │ │ │ │ return trainer_fn(*args, **kwargs) │
│ 651 │ │ # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re │
│ 652 │ │ except KeyboardInterrupt as exception: │
│ 653 │ │ │ rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown..." │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:735 in │
│ _fit_impl │
│ │
│ 732 │ │ self._ckpt_path = self.__set_ckpt_path( │
│ 733 │ │ │ ckpt_path, model_provided=True, model_connected=self.lightning_module is not │
│ 734 │ │ ) │
│ ❱ 735 │ │ results = self._run(model, ckpt_path=self.ckpt_path) │
│ 736 │ │ │
│ 737 │ │ assert self.state.stopped │
│ 738 │ │ self.training = False │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1166 │
│ in _run │
│ │
│ 1163 │ │ │
│ 1164 │ │ self._checkpoint_connector.resume_end() │
│ 1165 │ │ │
│ ❱ 1166 │ │ results = self._run_stage() │
│ 1167 │ │ │
│ 1168 │ │ log.detail(f"{self.__class__.__name__}: trainer tearing down") │
│ 1169 │ │ self._teardown() │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1252 │
│ in _run_stage │
│ │
│ 1249 │ │ │ return self._run_evaluate() │
│ 1250 │ │ if self.predicting: │
│ 1251 │ │ │ return self._run_predict() │
│ ❱ 1252 │ │ return self._run_train() │
│ 1253 │ │
│ 1254 │ def _pre_training_routine(self): │
│ 1255 │ │ # wait for all to join if on distributed │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1274 │
│ in _run_train │
│ │
│ 1271 │ │ self._pre_training_routine() │
│ 1272 │ │ │
│ 1273 │ │ with isolate_rng(): │
│ ❱ 1274 │ │ │ self._run_sanity_check() │
│ 1275 │ │ │
│ 1276 │ │ # enable train mode │
│ 1277 │ │ self.model.train() │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1343 │
│ in _run_sanity_check │
│ │
│ 1340 │ │ │ │
│ 1341 │ │ │ # run eval step │
│ 1342 │ │ │ with torch.no_grad(): │
│ ❱ 1343 │ │ │ │ val_loop.run() │
│ 1344 │ │ │ │
│ 1345 │ │ │ self._call_callback_hooks("on_sanity_check_end") │
│ 1346 │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\loop.py:200 in run │
│ │
│ 197 │ │ while not self.done: │
│ 198 │ │ │ try: │
│ 199 │ │ │ │ self.on_advance_start(*args, **kwargs) │
│ ❱ 200 │ │ │ │ self.advance(*args, **kwargs) │
│ 201 │ │ │ │ self.on_advance_end() │
│ 202 │ │ │ │ self._restarting = False │
│ 203 │ │ │ except StopIteration: │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\dataloader\evaluati │
│ on_loop.py:155 in advance │
│ │
│ 152 │ │ kwargs = OrderedDict() │
│ 153 │ │ if self.num_dataloaders > 1: │
│ 154 │ │ │ kwargs["dataloader_idx"] = dataloader_idx │
│ ❱ 155 │ │ dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) │
│ 156 │ │ │
│ 157 │ │ # store batch level output per dataloader │
│ 158 │ │ self._outputs.append(dl_outputs) │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\loop.py:200 in run │
│ │
│ 197 │ │ while not self.done: │
│ 198 │ │ │ try: │
│ 199 │ │ │ │ self.on_advance_start(*args, **kwargs) │
│ ❱ 200 │ │ │ │ self.advance(*args, **kwargs) │
│ 201 │ │ │ │ self.on_advance_end() │
│ 202 │ │ │ │ self._restarting = False │
│ 203 │ │ │ except StopIteration: │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\epoch\evaluation_ep │
│ och_loop.py:143 in advance │
│ │
│ 140 │ │ self.batch_progress.increment_started() │
│ 141 │ │ │
│ 142 │ │ # lightning module methods │
│ ❱ 143 │ │ output = self._evaluation_step(**kwargs) │
│ 144 │ │ output = self._evaluation_step_end(output) │
│ 145 │ │ │
│ 146 │ │ self.batch_progress.increment_processed() │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\epoch\evaluation_ep │
│ och_loop.py:240 in _evaluation_step │
│ │
│ 237 │ │ │ the outputs of the step │
│ 238 │ │ """ │
│ 239 │ │ hook_name = "test_step" if self.trainer.testing else "validation_step" │
│ ❱ 240 │ │ output = self.trainer._call_strategy_hook(hook_name, *kwargs.values()) │
│ 241 │ │ │
│ 242 │ │ return output │
│ 243 │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1704 │
│ in _call_strategy_hook │
│ │
│ 1701 │ │ │ return │
│ 1702 │ │ │
│ 1703 │ │ with self.profiler.profile(f"[Strategy]{self.strategy.__class__.__name__}.{hook_ │
│ ❱ 1704 │ │ │ output = fn(*args, **kwargs) │
│ 1705 │ │ │
│ 1706 │ │ # restore current_fx when nested context │
│ 1707 │ │ pl_module._current_fx_name = prev_fx_name │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\strategies\strategy.py:37 │
│ 0 in validation_step │
│ │
│ 367 │ │ """ │
│ 368 │ │ with self.precision_plugin.val_step_context(): │
│ 369 │ │ │ assert isinstance(self.model, ValidationStep) │
│ ❱ 370 │ │ │ return self.model.validation_step(*args, **kwargs) │
│ 371 │ │
│ 372 │ def test_step(self, *args: Any, **kwargs: Any) -> Optional[STEP_OUTPUT]: │
│ 373 │ │ """The actual test step. │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:252 in validation_step │
│ │
│ 249 │ │ batch_idx │
│ 250 │ │ │ Index of mini-batch. │
│ 251 │ │ """ │
│ ❱ 252 │ │ output, loss = self._shared_step(batch) │
│ 253 │ │ if self.model_postprocess_fn: │
│ 254 │ │ │ output = self.model_postprocess_fn(output) │
│ 255 │ │ # By default, on_step=False and on_epoch=True │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:210 in _shared_step │
│ │
│ 207 │ │ │ self.mixup_fn.mixup_enabled = self.training & (self.current_epoch < self.hpa │
│ 208 │ │ │ batch, label = multimodel_mixup(batch=batch, model=self.model, mixup_fn=self │
│ 209 │ │ output = run_model(self.model, batch) │
│ ❱ 210 │ │ loss = self._compute_loss(output=output, label=label) │
│ 211 │ │ return output, loss │
│ 212 │ │
│ 213 │ def training_step(self, batch, batch_idx): │
│ │
│ C:\Users\M Kharisma │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:178 in _compute_loss │
│ │
│ 175 │ │ │ │ loss += self._compute_template_loss(per_output, label) * weight │
│ 176 │ │ │ else: │
│ 177 │ │ │ │ loss += ( │
│ ❱ 178 │ │ │ │ │ self.loss_func( │
│ 179 │ │ │ │ │ │ input=per_output[LOGITS].squeeze(dim=1), │
│ 180 │ │ │ │ │ │ target=label, │
│ 181 │ │ │ │ │ ) │
│ │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py:1164 in forward │
│ │
│ 1161 │ │ self.label_smoothing = label_smoothing │
│ 1162 │ │
│ 1163 │ def forward(self, input: Tensor, target: Tensor) -> Tensor: │
│ ❱ 1164 │ │ return F.cross_entropy(input, target, weight=self.weight, │
│ 1165 │ │ │ │ │ │ │ ignore_index=self.ignore_index, reduction=self.reduction, │
│ 1166 │ │ │ │ │ │ │ label_smoothing=self.label_smoothing) │
│ 1167 │
│ │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py:3014 in cross_entropy │
│ │
│ 3011 │ │ ) │
│ 3012 │ if size_average is not None or reduce is not None: │
│ 3013 │ │ reduction = _Reduction.legacy_get_string(size_average, reduce) │
│ ❱ 3014 │ return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(re │
│ 3015 │
│ 3016 │
│ 3017 def binary_cross_entropy( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'
Expected behavior Able to fit the model normally.
# train_data consisting of image and label coloumns.
predictor = MultiModalPredictor(
label="label"
)
predictor.fit(
train_data=train_data,
time_limit=60*60*12, # seconds,
)
Screenshots If applicable, add screenshots to help explain your problem.
Installed Versions
Which version of AutoGluon are you are using?
If you are using 0.4.0 and newer, please run the following code snippet:
INSTALLED VERSIONS
------------------
date : 2023-01-29
time : 15:01:26.049318
python : 3.9.13.final.0
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
num_cores : 32
cpu_ram_mb : 32508
cuda version : None
num_gpus : 1
gpu_ram_mb : [21314]
avail_disk_size_mb : None
accelerate : 0.13.2
albumentations : 1.1.0
autogluon.common : 0.6.2
autogluon.core : 0.6.2
autogluon.features : 0.6.2
autogluon.multimodal : 0.6.2
autogluon.tabular : 0.6.2
autogluon.text : 0.6.2
autogluon.timeseries : 0.6.2
autogluon.vision : 0.6.2
boto3 : 1.24.28
catboost : 1.1.1
dask : 2021.11.2
defusedxml : 0.7.1
distributed : 2021.11.2
evaluate : 0.3.0
fairscale : 0.4.6
fastai : 2.7.10
gluoncv : 0.11.0
gluonts : 0.11.8
hyperopt : 0.2.7
joblib : 1.1.0
jsonschema : 4.8.0
lightgbm : 3.3.5
matplotlib : 3.5.2
networkx : 2.8.4
nlpaug : 1.1.10
nltk : 3.7
nptyping : 1.4.4
numpy : 1.21.5
omegaconf : 2.1.2
openmim : None
pandas : 1.4.4
PIL : 9.4.0
psutil : 5.9.0
pytorch-metric-learning: None
pytorch_lightning : 1.7.7
ray : 2.0.1
requests : 2.28.1
scipy : 1.8.1
sentencepiece : 0.1.97
seqeval : None
setuptools : 63.4.1
skimage : 0.19.2
sklearn : 1.0.2
smart_open : 5.2.1
statsmodels : 0.13.2
text-unidecode : None
timm : 0.6.12
torch : 1.12.1+cu113
torchmetrics : 0.8.2
torchtext : 0.13.1
torchvision : 0.13.1+cu113
tqdm : 4.64.1
transformers : 4.23.1
xgboost : 1.7.3
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (1 by maintainers)
Yes, it does. thank you too.