nocturne: RuntimeError: DataLoader worker is killed by signal: Floating point exception.

Operating system

Ubuntu 18.04

Bug description

When running the imitation learning baseline, I am sometimes getting a floating point exception. Unfortunately, It’s not deterministic and I cannot always reproduce. It just happens sometimes. Has anyone experienced this bug before?

Steps to reproduce

python examples/imitation_learning/train.py

Relevant log output

Error executing job with overrides: ['device=cuda:1'] Traceback (most recent call last): 
File "scripts/train.py", line 204, in  main dist = model.dist(states) 
File "/home/bernard.lange/imitation-learning-agents-research/./src/algos/imitation_learning/model.py", line 83, in dist return MultivariateNormal( File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/multivariate_normal.py", line 146, in init super(MultivariateNormal, self).init(batch_shape, event_shape, validate_args=validate_args) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/distribution.py", line 53, in init valid = constraint.check(value) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/constraints.py", line 509, in check sym_check = super().check(value) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/distributions/constraints.py", line 490, in check return torch.isclose(value, value.mT, atol=1e-6).all(-2).all(-1) 
File "/home/bernard.lange/miniconda3/envs/nocturne/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() 
RuntimeError: DataLoader worker (pid 7036) is killed by signal: Floating point exception.

ERROR: Unexpected floating-point exception encountered in worker.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 30 (2 by maintainers)

Most upvoted comments

I don’t think we can write a try, except block for floating point exceptions or assertion errors. I tried and it was still killing the worker and stopping the script.

Instead, I have iterated through the dataset with the above configs and created a dictionary of failing files (bash script with a loop until it finished iterating through a dataset). For now, I just skip those files during training.

We’re following up with Waymo here https://github.com/waymo-research/waymo-open-dataset/issues/542 and will hopefully find some resolution (though the floating point error is probably from a different source).

Hi @BenQLange. Just let you know a progress. It seems there exists one vehicle/object that has a negative length in tfrecord-00008-of-01000_364.json, which is at least the reason of assert failure. Now we are investigating why there is such values and will try to have some solution to deal with such cases.

We found an objects with shape of “width”: 4.4137163162231445, “length”: -1.295910358428955 in tfrecord-00008-of-01000_364.json

Oh! Okay, let me throw on the debug flag and try again. Thanks for the suggestion.

Hmm, we are still looking into it. I just got a new laptop with enough space for the whole dataset so hopefully I can reconstruct your errors and help.

Modified dataset resolves the assertion errors but I am still experiencing floating point exceptions from time to time 😦

Thanks for finding those! We are still looking into it but in the meantime would including a try, except block in your code temporarily resolve this issue so that you aren’t blocked? We should have a resolution soon.

Hi @BenQLange. Sorry for being late because of some other deadlines. I will take a detailed look later today and hopefully resolve it ASAP.

Hey @BenQLange, just to give you an update we’re slightly backlogged but Xiaomeng will take a look at this on Tuesday. Figured it was better to have a time than persistent uncertainty

I think that’s probably it; great job and thank you!! @xiaomengy (our C++ wizard) do you see how this error could occur? We could definitely use your insight here

I have enabled debug option in setup.py. Now I am getting the following errors:

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/line_segment.h:41: nocturne::geometry::Vector2D nocturne::geometry::LineSegment::Point(float) const: Assertion `t >= 0.0f && t <= 1.0f' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

python: /home/bernard.lange/nocturne/nocturne/cpp/include/geometry/polygon.h:67: nocturne::geometry::ConvexPolygon::ConvexPolygon(const std::initializer_list<nocturne::geometry::Vector2D>&): Assertion `VerifyVerticesOrder()' failed.

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

I am not a C++ wizard. Is it possible that those assertion errors lead to the floating point exception?

Oh that’s super useful that you can reproduce it without the training! So it’s in the worker or possibly in Nocturne itself… I’ll try to reproduce it using the smaller dataset but otherwise it’ll be a few days until my new laptop arrives and I can do some analysis on the full dataset, sorry!