CellBender: Cellbender V2 always hits NaN loss, crashes

Hi there, I was super pumped to try out version 2 so I pulled that branch. Unfortunately when I run cellbender remove-background --input ./spliced/ --output s_cellbended_ambient_200_1000_1000e_V2/s_cellbended.h5 --cuda --expected-cells 1598 --total-droplets-included 11598 --epochs 1000 --z-dim 200 --z-layers 1000 --learning-rate .001 --model ambient it always crashes after <100 epochs saying NaN training loss. Any idea why? Thought it might be helpful to report, anything to get the opportunity to get v2 running sooner!

cellbender:remove-background: [epoch 034] average training loss: 1529.0032 /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/ATen/native/cuda/Distributions.cu:290: lambda [](int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto::operator()(int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto: block: [0,0,0], thread: [96,0,0] Assertion 0 <= p4 && p4 <= 1 failed. and /utils/newminiconda3/envs/cellbenderV2/lib/python3.7/site-packages/pyro/infer/traceenum_elbo.py:419: UserWarning: Encountered NaN: loss log_p(c=2000 | full) = -23.890047073364258 log_p(c=2000 | empty) = -34.5873908996582 cell log_sum.mean() is 8.426836013793945 ~cell log_sum.mean() is 5.024422645568848 cell log_nnz.mean() is 7.62549352645874 ~cell log_nnz.mean() is 4.873218536376953 cell cosine_overlap.mean() is 0.8181769251823425 ~cell cosine_overlap.mean() is 0.4179271459579468 x.mean() is 2.4015369490371086e-05 x.std() is 0.0004832973063457757

RuntimeError: CUDA error: device-side assert triggered Trace Shapes: Param Sites: encoder_z$$$linears.0.weight 1000 41640 encoder_z$$$linears.0.bias 1000 encoder_z$$$loc_out.weight 200 1000 encoder_z$$$loc_out.bias 200 encoder_z$$$sig_out.weight 200 1000 encoder_z$$$sig_out.bias 200 encoder_other$$$linears.0.weight 50 41643 encoder_other$$$linears.0.bias 50 encoder_other$$$linears.1.weight 10 50 encoder_other$$$linears.1.bias 10 encoder_other$$$output.weight 4 10 encoder_other$$$output.bias 4 d_cell_scale alpha0_scale d_empty_loc d_empty_scale chi_ambient 41640 Sample Sites: data dist | value 500 | d_empty dist 500 | value 500 | p_passback dist 500 | value 500 | y dist 500 | value 500 |

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (7 by maintainers)

Most upvoted comments

To echo @sjfleming - I had a similar issue for a few samples, but dropping the learning rate and decreasing zdim to ~50 solved the issue. Sharing in case this is useful to anyone else.

Actually the new 0.2 release solved all my issues (Great job @sjfleming ) where it used to crash on 30-50% of my samples, I can now pass all 200 of them without NaN crashes as previously described… So I’m wondering what’s going on with @laijen000 samples

Great, if you have one example that produces a NaN on 2.1, that would be very useful!

Transfer is a good question… I’m struggling to think of a good option. What happens if you try to email it to sfleming@broadinstitute.org ? If you have a google email account, it might put it in a Google Drive somewhere I can access it temporarily? If that doesn’t work, maybe I can make a google bucket and give you write permission (if I can get a google email address from you that I can use to grant bucket permissions). Also open to any other ideas. 😃

Found an Easter Egg? lol couldn’t catch it with ultra ball. mu problem values: tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=<IndexBackward>) alpha problem values: tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', grad_fn=<IndexBackward>) A wild NaN appeared! In param {mu, alpha}

Does this mean it has failed altogether or simply an NaN during the training process? It appears to write outputs so I guess I’ll take a look at them

@sjfleming LIT! Firing it up now! I’ll let you know where/if any issues arise.

Thanks for reporting! I am actively working on this branch, and hopefully I will resolve this NaN issue soon. The code will be in various stages of workability over the next few weeks, at which time we’re hoping to have a more official release of v2.

As for the issue with reading the output into scanpy: this is something I know about. I need to merge a change from another branch that fixes this. The newer versions of scanpy require the ‘gene_id’ field in addition to ‘gene_name’, but I was not including ‘gene_id’ in cellbender remove-background. It certainly should be included moving forward. I will include that fix in v2 as well.