miceforest: Using mean_match_candidates different from zero with categorical variables generates an error

Hi,

So other issue that I found using categorical variables imputation (category dtype) is that defining mean_match_candidates != 0 in Kernel definition generate an issue during .mice().

Error message:

/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
  odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
  log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-106-6f8a47b02d4b> in <module>
      9                                    mean_match_candidates=2)
     10 # Run the MICE algorithm for X iterations - 12:09 - 13:22
---> 11 a.mice(iterations=2, verbose=True, n_jobs=-1)

/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
   1193                                 random_state=self._random_state,
   1194                                 hashed_seeds=None,
-> 1195                                 candidate_preds=candidate_preds,
   1196                             )
   1197                         )

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
    361                 candidate_values,
    362                 random_state,
--> 363                 hashed_seeds,
    364             )
    365 

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
    119 
    120     index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121     imp_values = candidate_values[index_choice]
    122 
    123     return imp_values

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    964             return self._get_values(key)
    965 
--> 966         return self._get_with(key)
    967 
    968     def _get_with(self, key):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
    999             #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1000             if not self.index._should_fallback_to_positional():
-> 1001                 return self.loc[key]
   1002             else:
   1003                 return self.iloc[key]

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: '[129846] not in index'

What I found is that using the default of mean_matching_scheme parameter was causing the issue (because will evaluate for categorical features the Mean Matching) and that using miceforest.mean_match_schemes.mean_match_scheme_fast_cat was turning around for that.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (14 by maintainers)

Most upvoted comments

@KaikeWesleyReis If you wouldn’t mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?

To do this, the following commands should work in the terminal (Windows):

git clone https://github.com/AnotherSamWilson/miceforest.git
cd miceforest
git checkout MeanMatchScheme

python -m setup.py sdist
pip install dist/miceforest-5.6.0.tar.gz

If you have linux, the commands should be similar.

EDIT - I should also note that this version changes how mean matching is controlled. There is now a MeanMatchScheme class, which you probably won’t need to mess with. The README for this branch has updated examples of how work with the new structure. Otherwise, things are pretty much the same as they were.

AnotherSamWilson on Jul 28, 2022