Ax: Crash while optimizing: RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.

I got this recently trying to tune the hyperparameters on an MLP.

Relevant versions:

python==3.7.1
ax-platform==0.1.2
botorch==0.1.0
gpytorch==0.3.2
scipy==1.1.0
torch==1.1.0

I’m using ax.optimize() as the entrypoint. It was 45 trials into the experiment. Here’s the stack trace.

ax.service.managed_loop: Running optimization trial 45...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-37-5c288f8ea2dd> in <module>
     76         ],
     77         evaluation_function=do_train,
---> 78         minimize=True,
     79     )

~/anaconda3/lib/python3.7/site-packages/ax/service/managed_loop.py in optimize(parameters, evaluation_function, experiment_name, objective_name, minimize, parameter_constraints, outcome_constraints, total_trials, arms_per_trial, wait_time)
    204         wait_time=wait_time,
    205     )
--> 206     loop.full_run()
    207     parameterization, values = loop.get_best_point()
    208     return parameterization, values, loop.experiment, loop.get_current_model()

~/anaconda3/lib/python3.7/site-packages/ax/service/managed_loop.py in full_run(self)
    148         logger.info(f"Started full optimization with {num_steps} steps.")
    149         for _ in range(num_steps):
--> 150             self.run_trial()
    151         return self
    152 

~/anaconda3/lib/python3.7/site-packages/ax/service/managed_loop.py in run_trial(self)
    128             trial = self.experiment.new_trial(
    129                 generator_run=self.generation_strategy.gen(
--> 130                     experiment=self.experiment, new_data=dat
    131                 )
    132             )

~/anaconda3/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py in gen(self, experiment, new_data, n, **kwargs)
    161         elif new_data is not None:
    162             # We're sticking with the current model, but update with new data
--> 163             self._model.update(experiment=experiment, data=new_data)
    164 
    165         gen_run = not_none(self._model).gen(n=n, **(self._curr.model_gen_kwargs or {}))

~/anaconda3/lib/python3.7/site-packages/ax/modelbridge/base.py in update(self, data, experiment)
    385             obs_feats = t.transform_observation_features(obs_feats)
    386             obs_data = t.transform_observation_data(obs_data, obs_feats)
--> 387         self._update(observation_features=obs_feats, observation_data=obs_data)
    388         self.fit_time += time.time() - t_update_start
    389         self.fit_time_since_gen += time.time() - t_update_start

~/anaconda3/lib/python3.7/site-packages/ax/modelbridge/array.py in _update(self, observation_features, observation_data)
    110         # Update in-design status for these new points.
    111         self.training_in_design[-len(observation_features) :] = in_design
--> 112         self._model_update(Xs=Xs_array, Ys=Ys_array, Yvars=Yvars_array)
    113 
    114     def _model_update(

~/anaconda3/lib/python3.7/site-packages/ax/modelbridge/torch.py in _model_update(self, Xs, Ys, Yvars)
    113         Ys: List[Tensor] = self._array_list_to_tensors(Ys)
    114         Yvars: List[Tensor] = self._array_list_to_tensors(Yvars)
--> 115         self.model.update(Xs=Xs, Ys=Ys, Yvars=Yvars)
    116 
    117     def _model_predict(self, X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:

~/anaconda3/lib/python3.7/site-packages/ax/models/torch/botorch.py in update(self, Xs, Ys, Yvars)
    372             Yvars=self.Yvars,
    373             task_features=self.task_features,
--> 374             state_dict=state_dict,
    375         )

~/anaconda3/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py in get_and_fit_model(Xs, Ys, Yvars, task_features, state_dict, **kwargs)
     84             # pyre-ignore: [16]
     85             mll = ExactMarginalLogLikelihood(model.likelihood, model)
---> 86         mll = fit_gpytorch_model(mll, bounds=bounds)
     87     else:
     88         model.load_state_dict(state_dict)

~/anaconda3/lib/python3.7/site-packages/botorch/fit.py in fit_gpytorch_model(mll, optimizer, **kwargs)
     33     """
     34     mll.train()
---> 35     mll, _ = optimizer(mll, track_iterations=False, **kwargs)
     36     mll.eval()
     37     return mll

~/anaconda3/lib/python3.7/site-packages/botorch/optim/fit.py in fit_gpytorch_scipy(mll, bounds, method, options, track_iterations)
    186         jac=True,
    187         options=options,
--> 188         callback=cb,
    189     )
    190     iterations = []

~/anaconda3/lib/python3.7/site-packages/scipy/optimize/_minimize.py in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    601     elif meth == 'l-bfgs-b':
    602         return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 603                                 callback=callback, **options)
    604     elif meth == 'tnc':
    605         return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,

~/anaconda3/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    333             # until the completion of the current minimization iteration.
    334             # Overwrite f and g:
--> 335             f, g = func_and_grad(x)
    336         elif task_str.startswith(b'NEW_X'):
    337             # new iteration

~/anaconda3/lib/python3.7/site-packages/scipy/optimize/lbfgsb.py in func_and_grad(x)
    283     else:
    284         def func_and_grad(x):
--> 285             f = fun(x, *args)
    286             g = jac(x, *args)
    287             return f, g

~/anaconda3/lib/python3.7/site-packages/scipy/optimize/optimize.py in function_wrapper(*wrapper_args)
    291     def function_wrapper(*wrapper_args):
    292         ncalls[0] += 1
--> 293         return function(*(wrapper_args + args))
    294 
    295     return ncalls, function_wrapper

~/anaconda3/lib/python3.7/site-packages/scipy/optimize/optimize.py in __call__(self, x, *args)
     61     def __call__(self, x, *args):
     62         self.x = numpy.asarray(x).copy()
---> 63         fg = self.fun(x, *args)
     64         self.jac = fg[1]
     65         return fg[0]

~/anaconda3/lib/python3.7/site-packages/botorch/optim/fit.py in _scipy_objective_and_grad(x, mll, property_dict)
    221     output = mll.model(*train_inputs)
    222     args = [output, train_targets] + _get_extra_mll_args(mll)
--> 223     loss = -mll(*args).sum()
    224     loss.backward()
    225     param_dict = OrderedDict(mll.named_parameters())

~/anaconda3/lib/python3.7/site-packages/gpytorch/module.py in __call__(self, *inputs, **kwargs)
     20 
     21     def __call__(self, *inputs, **kwargs):
---> 22         outputs = self.forward(*inputs, **kwargs)
     23         if isinstance(outputs, list):
     24             return [_validate_module_outputs(output) for output in outputs]

~/anaconda3/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py in forward(self, output, target, *params)
     26         # Get the log prob of the marginal distribution
     27         output = self.likelihood(output, *params)
---> 28         res = output.log_prob(target)
     29 
     30         # Add terms for SGPR / when inducing points are learned

~/anaconda3/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py in log_prob(self, value)
    127 
    128         # Get log determininat and first part of quadratic form
--> 129         inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
    130 
    131         res = -0.5 * sum([inv_quad, logdet, diff.size(-1) * math.log(2 * math.pi)])

~/anaconda3/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py in inv_quad_logdet(self, inv_quad_rhs, logdet, reduce_inv_quad)
    990             from .chol_lazy_tensor import CholLazyTensor
    991 
--> 992             cholesky = CholLazyTensor(self.cholesky())
    993             return cholesky.inv_quad_logdet(inv_quad_rhs=inv_quad_rhs, logdet=logdet, reduce_inv_quad=reduce_inv_quad)
    994 

~/anaconda3/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py in cholesky(self, upper)
    716             (LazyTensor) Cholesky factor (lower triangular)
    717         """
--> 718         res = self._cholesky()
    719         if upper:
    720             res = res.transpose(-1, -2)

~/anaconda3/lib/python3.7/site-packages/gpytorch/utils/memoize.py in g(self, *args, **kwargs)
     32         cache_name = name if name is not None else method
     33         if not is_in_cache(self, cache_name):
---> 34             add_to_cache(self, cache_name, method(self, *args, **kwargs))
     35         return get_from_cache(self, cache_name)
     36 

~/anaconda3/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py in _cholesky(self)
    401             evaluated_mat.register_hook(_ensure_symmetric_grad)
    402 
--> 403         cholesky = psd_safe_cholesky(evaluated_mat.double()).to(self.dtype)
    404         return NonLazyTensor(cholesky)
    405 

~/anaconda3/lib/python3.7/site-packages/gpytorch/utils/cholesky.py in psd_safe_cholesky(A, upper, out, jitter)
     45                 continue
     46 
---> 47         raise e
     48 
     49 

~/anaconda3/lib/python3.7/site-packages/gpytorch/utils/cholesky.py in psd_safe_cholesky(A, upper, out, jitter)
     19     """
     20     try:
---> 21         L = torch.cholesky(A, upper=upper, out=out)
     22         # TODO: Remove once fixed in pytorch (#16780)
     23         if A.dim() > 2 and A.is_cuda:

RuntimeError: cholesky_cpu: U(1,1) is zero, singular U.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 16 (12 by maintainers)

Most upvoted comments

@Balandat it hasn’t happened again since I lowered the clamping value down to 1. Also, I’m not using that blackbox function any more as there was a bug in the data pre-processing stage. That was causing the good objective values to be all very close together, which is consistent with the theory that the scaling of the objectives was the problem. Although the timing of the error is odd (iteration 31 vs 45). Since the jittering is stochastic, maybe it’s just bad luck – we should configure the jitter warnings to show up in the logs, because if the jitter warnings start showing up right after 31 and just happen to go past 3 tries on 45, that’s a strong signal.

@leopd, I took a closer look at this and wasn’t able to reproduce the fitting issues (though I can’t rule out that I messed up in trying to re-create the experiment state from the logs, I’ll ask some other folks to double-check this).

Is this issue reproducible on your end?

Here is the notebook that I used, let me know if this looks sane to you: debug_leos_fitting_issue.ipynb.txt

Aside: Above I stated that

Optimized hyperparameters for a model evaluated on data with vastly different objective values could end up causing numerical issues in the solves that result in NaNs.

This argument doesn’t convince me anymore, since the clamping happens much earlier (iteration 31) than the failure in the model fitting. If this were the cause, one would expect this to happen once the model gets refit on the outlier value.

Aha, the plot thickens. Ax by default standardizes the outcome values internally (zero mean, unit variance) so that the default hyperparameter priors work for different scales. Clamping the objective value to a 10 (which is about two orders of magnitude larger than the largest other observation) could well lead to this issue during model fitting, in particular since by default we warm-start the model parameters from the ones of the fitted model in the previous iteration. Optimized hyperparameters for a model evaluated on data with vastly different objective values could end up causing numerical issues in the solves that result in NaNs.

Let me look into this in more detail tomorrow using the parameter and objective values from the log. In the meantime, as a sanity check you could try choosing a much smaller clamp value for now (maybe something around 0.5-1 or so).

Ultimately such failures like the model diverging should be handled in a special fashion (e.g. marking the trail as “failed” and possibly automatically backing off from the bounds), but that will require some additional work. In the diverging trial, do you notice anything particular w.r.t the parameters (close to the boundary of the design space)?