autogluon: [BUG] RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'

I have checked that this bug exists on the latest stable version of AutoGluon
and/or I have checked that this bug exists on the latest mainline of AutoGluon via source installation

Describe the bug Bug when starting .fit() method. Traceback:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│   2 │   label="label",                                                                           │
│   3 #     path="/kaggle/working/AutogluonModels/ag-20221214_131455"                              │
│   4 )                                                                                            │
│ ❱ 5 predictor.fit(                                                                               │
│   6 │   train_data=train_data,                                                                   │
│   7 │   time_limit=60*60*12, # seconds,                                                          │
│   8 ) # you can trust the default config, e.g., we use a `swin_base_patch4_window7_224` mode     │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\predictor.py:839 in    │
│ fit                                                                                              │
│                                                                                                  │
│    836 │   │   │   )                                                                             │
│    837 │   │   │   return predictor                                                              │
│    838 │   │                                                                                     │
│ ❱  839 │   │   self._fit(**_fit_args)                                                            │
│    840 │   │   training_end = time.time()                                                        │
│    841 │   │   self._total_train_time = training_end - training_start                            │
│    842 │   │   logger.info(f"Models and intermediate outputs are saved to {self._save_path} ")   │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\predictor.py:1386 in   │
│ _fit                                                                                             │
│                                                                                                  │
│   1383 │   │   │   │   ".* in the `DataLoader` init to improve performance.*",                   │
│   1384 │   │   │   )                                                                             │
│   1385 │   │   │   warnings.filterwarnings("ignore", "Checkpoint directory .* exists and is not  │
│ ❱ 1386 │   │   │   trainer.fit(                                                                  │
│   1387 │   │   │   │   task,                                                                     │
│   1388 │   │   │   │   datamodule=train_dm,                                                      │
│   1389 │   │   │   │   ckpt_path=ckpt_path if resume else None,  # this is to resume training t  │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:696 in │
│ fit                                                                                              │
│                                                                                                  │
│    693 │   │   │   datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn  │
│    694 │   │   """                                                                               │
│    695 │   │   self.strategy.model = model                                                       │
│ ❱  696 │   │   self._call_and_handle_interrupt(                                                  │
│    697 │   │   │   self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_  │
│    698 │   │   )                                                                                 │
│    699                                                                                           │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:650 in │
│ _call_and_handle_interrupt                                                                       │
│                                                                                                  │
│    647 │   │   │   if self.strategy.launcher is not None:                                        │
│    648 │   │   │   │   return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **  │
│    649 │   │   │   else:                                                                         │
│ ❱  650 │   │   │   │   return trainer_fn(*args, **kwargs)                                        │
│    651 │   │   # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re  │
│    652 │   │   except KeyboardInterrupt as exception:                                            │
│    653 │   │   │   rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown..."  │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:735 in │
│ _fit_impl                                                                                        │
│                                                                                                  │
│    732 │   │   self._ckpt_path = self.__set_ckpt_path(                                           │
│    733 │   │   │   ckpt_path, model_provided=True, model_connected=self.lightning_module is not  │
│    734 │   │   )                                                                                 │
│ ❱  735 │   │   results = self._run(model, ckpt_path=self.ckpt_path)                              │
│    736 │   │                                                                                     │
│    737 │   │   assert self.state.stopped                                                         │
│    738 │   │   self.training = False                                                             │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1166   │
│ in _run                                                                                          │
│                                                                                                  │
│   1163 │   │                                                                                     │
│   1164 │   │   self._checkpoint_connector.resume_end()                                           │
│   1165 │   │                                                                                     │
│ ❱ 1166 │   │   results = self._run_stage()                                                       │
│   1167 │   │                                                                                     │
│   1168 │   │   log.detail(f"{self.__class__.__name__}: trainer tearing down")                    │
│   1169 │   │   self._teardown()                                                                  │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1252   │
│ in _run_stage                                                                                    │
│                                                                                                  │
│   1249 │   │   │   return self._run_evaluate()                                                   │
│   1250 │   │   if self.predicting:                                                               │
│   1251 │   │   │   return self._run_predict()                                                    │
│ ❱ 1252 │   │   return self._run_train()                                                          │
│   1253 │                                                                                         │
│   1254 │   def _pre_training_routine(self):                                                      │
│   1255 │   │   # wait for all to join if on distributed                                          │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1274   │
│ in _run_train                                                                                    │
│                                                                                                  │
│   1271 │   │   self._pre_training_routine()                                                      │
│   1272 │   │                                                                                     │
│   1273 │   │   with isolate_rng():                                                               │
│ ❱ 1274 │   │   │   self._run_sanity_check()                                                      │
│   1275 │   │                                                                                     │
│   1276 │   │   # enable train mode                                                               │
│   1277 │   │   self.model.train()                                                                │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1343   │
│ in _run_sanity_check                                                                             │
│                                                                                                  │
│   1340 │   │   │                                                                                 │
│   1341 │   │   │   # run eval step                                                               │
│   1342 │   │   │   with torch.no_grad():                                                         │
│ ❱ 1343 │   │   │   │   val_loop.run()                                                            │
│   1344 │   │   │                                                                                 │
│   1345 │   │   │   self._call_callback_hooks("on_sanity_check_end")                              │
│   1346                                                                                           │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\loop.py:200 in run  │
│                                                                                                  │
│   197 │   │   while not self.done:                                                               │
│   198 │   │   │   try:                                                                           │
│   199 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 200 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   201 │   │   │   │   self.on_advance_end()                                                      │
│   202 │   │   │   │   self._restarting = False                                                   │
│   203 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\dataloader\evaluati │
│ on_loop.py:155 in advance                                                                        │
│                                                                                                  │
│   152 │   │   kwargs = OrderedDict()                                                             │
│   153 │   │   if self.num_dataloaders > 1:                                                       │
│   154 │   │   │   kwargs["dataloader_idx"] = dataloader_idx                                      │
│ ❱ 155 │   │   dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)       │
│   156 │   │                                                                                      │
│   157 │   │   # store batch level output per dataloader                                          │
│   158 │   │   self._outputs.append(dl_outputs)                                                   │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\loop.py:200 in run  │
│                                                                                                  │
│   197 │   │   while not self.done:                                                               │
│   198 │   │   │   try:                                                                           │
│   199 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 200 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   201 │   │   │   │   self.on_advance_end()                                                      │
│   202 │   │   │   │   self._restarting = False                                                   │
│   203 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\epoch\evaluation_ep │
│ och_loop.py:143 in advance                                                                       │
│                                                                                                  │
│   140 │   │   self.batch_progress.increment_started()                                            │
│   141 │   │                                                                                      │
│   142 │   │   # lightning module methods                                                         │
│ ❱ 143 │   │   output = self._evaluation_step(**kwargs)                                           │
│   144 │   │   output = self._evaluation_step_end(output)                                         │
│   145 │   │                                                                                      │
│   146 │   │   self.batch_progress.increment_processed()                                          │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\loops\epoch\evaluation_ep │
│ och_loop.py:240 in _evaluation_step                                                              │
│                                                                                                  │
│   237 │   │   │   the outputs of the step                                                        │
│   238 │   │   """                                                                                │
│   239 │   │   hook_name = "test_step" if self.trainer.testing else "validation_step"             │
│ ❱ 240 │   │   output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())             │
│   241 │   │                                                                                      │
│   242 │   │   return output                                                                      │
│   243                                                                                            │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\trainer\trainer.py:1704   │
│ in _call_strategy_hook                                                                           │
│                                                                                                  │
│   1701 │   │   │   return                                                                        │
│   1702 │   │                                                                                     │
│   1703 │   │   with self.profiler.profile(f"[Strategy]{self.strategy.__class__.__name__}.{hook_  │
│ ❱ 1704 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1705 │   │                                                                                     │
│   1706 │   │   # restore current_fx when nested context                                          │
│   1707 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\strategies\strategy.py:37 │
│ 0 in validation_step                                                                             │
│                                                                                                  │
│   367 │   │   """                                                                                │
│   368 │   │   with self.precision_plugin.val_step_context():                                     │
│   369 │   │   │   assert isinstance(self.model, ValidationStep)                                  │
│ ❱ 370 │   │   │   return self.model.validation_step(*args, **kwargs)                             │
│   371 │                                                                                          │
│   372 │   def test_step(self, *args: Any, **kwargs: Any) -> Optional[STEP_OUTPUT]:               │
│   373 │   │   """The actual test step.                                                           │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:252 in validation_step                                                                      │
│                                                                                                  │
│   249 │   │   batch_idx                                                                          │
│   250 │   │   │   Index of mini-batch.                                                           │
│   251 │   │   """                                                                                │
│ ❱ 252 │   │   output, loss = self._shared_step(batch)                                            │
│   253 │   │   if self.model_postprocess_fn:                                                      │
│   254 │   │   │   output = self.model_postprocess_fn(output)                                     │
│   255 │   │   # By default, on_step=False and on_epoch=True                                      │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:210 in _shared_step                                                                         │
│                                                                                                  │
│   207 │   │   │   self.mixup_fn.mixup_enabled = self.training & (self.current_epoch < self.hpa   │
│   208 │   │   │   batch, label = multimodel_mixup(batch=batch, model=self.model, mixup_fn=self   │
│   209 │   │   output = run_model(self.model, batch)                                              │
│ ❱ 210 │   │   loss = self._compute_loss(output=output, label=label)                              │
│   211 │   │   return output, loss                                                                │
│   212 │                                                                                          │
│   213 │   def training_step(self, batch, batch_idx):                                             │
│                                                                                                  │
│ C:\Users\M Kharisma                                                                              │
│ Azhari\AppData\Roaming\Python\Python39\site-packages\autogluon\multimodal\optimization\lit_modul │
│ e.py:178 in _compute_loss                                                                        │
│                                                                                                  │
│   175 │   │   │   │   loss += self._compute_template_loss(per_output, label) * weight            │
│   176 │   │   │   else:                                                                          │
│   177 │   │   │   │   loss += (                                                                  │
│ ❱ 178 │   │   │   │   │   self.loss_func(                                                        │
│   179 │   │   │   │   │   │   input=per_output[LOGITS].squeeze(dim=1),                           │
│   180 │   │   │   │   │   │   target=label,                                                      │
│   181 │   │   │   │   │   )                                                                      │
│                                                                                                  │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py:1130 in _call_impl         │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py:1164 in forward              │
│                                                                                                  │
│   1161 │   │   self.label_smoothing = label_smoothing                                            │
│   1162 │                                                                                         │
│   1163 │   def forward(self, input: Tensor, target: Tensor) -> Tensor:                           │
│ ❱ 1164 │   │   return F.cross_entropy(input, target, weight=self.weight,                         │
│   1165 │   │   │   │   │   │   │      ignore_index=self.ignore_index, reduction=self.reduction,  │
│   1166 │   │   │   │   │   │   │      label_smoothing=self.label_smoothing)                      │
│   1167                                                                                           │
│                                                                                                  │
│ c:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py:3014 in cross_entropy          │
│                                                                                                  │
│   3011 │   │   )                                                                                 │
│   3012 │   if size_average is not None or reduce is not None:                                    │
│   3013 │   │   reduction = _Reduction.legacy_get_string(size_average, reduce)                    │
│ ❱ 3014 │   return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(re  │
│   3015                                                                                           │
│   3016                                                                                           │
│   3017 def binary_cross_entropy(                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Int'

Expected behavior Able to fit the model normally.

# train_data consisting of image and label coloumns.
predictor = MultiModalPredictor(
    label="label"
)
predictor.fit(
    train_data=train_data,
    time_limit=60*60*12, # seconds,
)

Screenshots If applicable, add screenshots to help explain your problem.

Installed Versions Which version of AutoGluon are you are using?
If you are using 0.4.0 and newer, please run the following code snippet:


INSTALLED VERSIONS
------------------
date                   : 2023-01-29
time                   : 15:01:26.049318
python                 : 3.9.13.final.0
OS                     : Windows
OS-release             : 10
Version                : 10.0.22621
machine                : AMD64
processor              : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
num_cores              : 32
cpu_ram_mb             : 32508
cuda version           : None
num_gpus               : 1
gpu_ram_mb             : [21314]
avail_disk_size_mb     : None

accelerate             : 0.13.2
albumentations         : 1.1.0
autogluon.common       : 0.6.2
autogluon.core         : 0.6.2
autogluon.features     : 0.6.2
autogluon.multimodal   : 0.6.2
autogluon.tabular      : 0.6.2
autogluon.text         : 0.6.2
autogluon.timeseries   : 0.6.2
autogluon.vision       : 0.6.2
boto3                  : 1.24.28
catboost               : 1.1.1
dask                   : 2021.11.2
defusedxml             : 0.7.1
distributed            : 2021.11.2
evaluate               : 0.3.0
fairscale              : 0.4.6
fastai                 : 2.7.10
gluoncv                : 0.11.0
gluonts                : 0.11.8
hyperopt               : 0.2.7
joblib                 : 1.1.0
jsonschema             : 4.8.0
lightgbm               : 3.3.5
matplotlib             : 3.5.2
networkx               : 2.8.4
nlpaug                 : 1.1.10
nltk                   : 3.7
nptyping               : 1.4.4
numpy                  : 1.21.5
omegaconf              : 2.1.2
openmim                : None
pandas                 : 1.4.4
PIL                    : 9.4.0
psutil                 : 5.9.0
pytorch-metric-learning: None
pytorch_lightning      : 1.7.7
ray                    : 2.0.1
requests               : 2.28.1
scipy                  : 1.8.1
sentencepiece          : 0.1.97
seqeval                : None
setuptools             : 63.4.1
skimage                : 0.19.2
sklearn                : 1.0.2
smart_open             : 5.2.1
statsmodels            : 0.13.2
text-unidecode         : None
timm                   : 0.6.12
torch                  : 1.12.1+cu113
torchmetrics           : 0.8.2
torchtext              : 0.13.1
torchvision            : 0.13.1+cu113
tqdm                   : 4.64.1
transformers           : 4.23.1
xgboost                : 1.7.3

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (1 by maintainers)

Most upvoted comments

@muazhari Just want to check here as well, does restart kaggle runtime solve your issue in #2572 ? Thank you.

Yes, it does. thank you too.

muazhari on Feb 8, 2023