tabnet: TypeError when saving a model with `numpy.bool_` types

numpy.bool_ types are not being correctly serialized to json.

What is the current behavior? The ComplexEncoder class (here) does not handle numpy.bool_ which is not JSON serializable. This raises a TypeError when saving certain models.

If the current behavior is a bug, please provide the steps to reproduce.

model = TabNetClassifier(...)
model.fit(...)  # training data and model parameters contain values of type numpy.bool_
model.save_model('path/to/model')

Expected behavior numpy.bool_ should be cast to python’s bool before being serialized to JSON. Here is my suggested fix. Please let me know if this is acceptable for a PR:

class ComplexEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64):
            return int(obj)
        if isinstance(obj, np.bool_):
            return bool(obj)
        # Let the base class default method raise the TypeError
        return json.JSONEncoder.default(self, obj)

Other relevant information: poetry version: “poetry-core>=1.0.0” python version: “^3.9” Operating System: “Linux Kernel 5.18.14-arch1-1” Additional tools: CUDA Version: 11.7 Driver Version: 515.57

Additional context

Here’s a stacktrace:

  File ".venv/lib/python3.10/site-packages/pytorch_tabnet/abstract_model.py", line 375, in save_model
    json.dump(saved_params, f, cls=ComplexEncoder)
  File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File ".venv/lib/python3.10/site-packages/pytorch_tabnet/utils.py", line 339, in default
    return json.JSONEncoder.default(self, obj)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bool_ is not JSON serializable

I ran into this when trying tabnet in a kaggle competition. If you need to, you can look here in my code where the error happens.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 18

Most upvoted comments

@Optimox Hi. I don’t know if that happens in the AMEX competition, but I guess so, since the json encoding is not working for dtypes other than np.int64.

Sorry for not being clear enough in my description of the problem. I’ve attached therefor a minimal working example to trigger the bug.

As said the problem is that y_train aka the target variable is of type bool (or np.int8 in my case) and you’re only handling np.int64 in ComplexEncoder https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L338

https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L336-L341

  import os
  import wget
  import pandas as pd
  import numpy as np
  from pathlib import Path
  from pytorch_tabnet.tab_model import TabNetClassifier
  url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
  dataset_name = 'census-income'
  out = Path(os.getcwd()+'/data/'+dataset_name+'.csv')
  out.parent.mkdir(parents=True, exist_ok=True)
  if out.exists():
      print("File already exists.")
  else:
      print("Downloading file...")
      wget.download(url, out.as_posix())
  features = ['39', ' 77516', ' 13']
  train = pd.read_csv(out)
  train = train[features + [' <=50K']]
  train['target'] = train[' <=50K'] == '<=50K'
  train = train.drop(columns=[' <=50K'])
  if "Set" not in train.columns:
      train["Set"] = np.random.choice(["train", "valid", "test"], p =[.8, .1, .1], size=(train.shape[0],))
  
  train_indices = train[train.Set=="train"].index
  valid_indices = train[train.Set=="valid"].index
  test_indices = train[train.Set=="test"].index
  
  X_train = train[features].values[train_indices]
  y_train = train['target'].values[train_indices]
  
  X_valid = train[features].values[valid_indices]
  y_valid = train['target'].values[valid_indices]
  
  X_test = train[features].values[test_indices]
  y_test = train['target'].values[test_indices]
  
  clf = TabNetClassifier()
  clf.fit(X_train=X_train, y_train=y_train,max_epochs=2)
  
  saving_path_name = "./tabnet_model_test_1"
  saved_filepath = clf.save_model(saving_path_name)

I don’t have a timeline to share. I think making sure during training that the targets columns has type int instead of np.int should solve the problem, I never had this problem to be honest.

Thanks @Optimox the above comment solved my issue re converted the types I was shrinking to save data for the labels .

thanks I’ll fix this soon

I have the same problem with an int8:

Object of type int8 is not JSON serializable

It’s also raised from the ComplexEncoder. It seems to come from {'preds_mapper': {'0': 0, '1': 1}} where the values 0 and 1 have the type np.int8 (apparently because my target variable is an int8 like the OP seems to use a bool for their target).

So as a workaround for the time being one could cast the target variable to np.int64 which seems to be the only np.intX ComplexEncoder can encode right now.

https://github.com/dreamquark-ai/tabnet/blob/5ac55834b32693abc4b22028a74475ee0440c2a5/pytorch_tabnet/utils.py#L336-L341

How about replacing line 338~339 by

         if isinstance(obj, (np.generic, np.ndarray)): 
             return obj.tolist()

It seems that only TabNetClassifier object has this problem. The type of {'preds_mapper': {'0': 0, '1': 1}} values are given by user when user call TabNetClassifier.fit. Using numpy method tolist() can solve all similar problems not only for np.bool_ but also np.int32 or other numpy generic types. On the other way, it maybe better to convert train_labels value to JSON compatible types before assign to preds_mapper.

https://github.com/dreamquark-ai/tabnet/blob/cab643b156fdecfded51d70d29072fc43f397bbb/pytorch_tabnet/tab_model.py#L45-L64

I had a quick look, but I don’t know where this can be coming from… I don’t see how training data could change the weights either. Have you made any changes to the model/architecture at all?

Anyway, I agree with the fix, but would be good to know why it’s happening. I will have a deeper look later.