sockeye: Sockeye freezes at new validation start [v1.18.54]
For the third time in a few days and on 2 independent trainings, I observed that Sockeye freezes after starting some new validation, i.e. it does not crash, does not send any warning, but stops going forward (0% on CPU/GPU). Here are the last lines of my log file before this issue occurs:
[2018-09-24:21:45:33:INFO:sockeye.training:__call__] Epoch[3] Batch [270000] Speed: 650.11 samples/sec 22445.47 tokens/sec 2.06 updates/sec perplexity=3.5
46109
[2018-09-24:21:45:34:INFO:root:save_params_to_file] Saved params to "/run/work/generic_fr2en/model_baseline/params.00007"
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7] Updates=270000 Epoch=3 Samples=81602144 Time-cost=4711.141 Updates/sec=2.123
[2018-09-24:21:45:34:INFO:sockeye.training:fit] Checkpoint [7] Train-perplexity=3.546109
[2018-09-24:21:45:36:INFO:sockeye.training:fit] Checkpoint [7] Validation-perplexity=3.752938
[2018-09-24:21:45:36:INFO:sockeye.utils:log_gpu_memory_usage] GPU 0: 10093/11178 MB (90.29%) GPU 1: 9791/11178 MB (87.59%) GPU 2: 9795/11178 MB (87.63%) GPU 3: 9789/11178 MB (87.57%)
[2018-09-24:21:45:36:INFO:sockeye.training:collect_results] Decoder-6 finished: {'rouge2-val': 0.4331754429258854, 'rouge1-val': 0.6335038896620699, 'decode-walltime-val': 3375.992604494095, 'rougel-val': 0.5947101830587342, 'avg-sec-per-sent-val': 1.794786073627908, 'chrf-val': 0.6585073715647153, 'bleu-val': 0.43439024563194745}
[2018-09-24:21:45:36:INFO:sockeye.training:start_decoder] Starting process: Decoder-7
So at this point, it has outputted params.00007. When I kill the Sockeye process and restart to continue training, it starts again after validation 6 (update 260000), then later overwrites params.00007, starts Decoder-7 and continues training successfully.
I noted that the freezing occurs at the same moment as in #462, but I have no idea whether it is related to this case. I checked all parameters of the last param file after the issue with numpy.isnan() and no nans were reported.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 30 (30 by maintainers)
I merged the change. Let us know if you have any issues. I will close the issue for now.
That is great to hear. In the internal evaluation we ran we also did not not observe this issue with the forkserver branch anymore. We should now move ahead an integrate this change into the master branch 😃
That’s unfortunate! If you could try the forkserver branch again, to see whether this fixes your issue, that would be highly appreciated. I’m currently still looking into this issue and trying to confirm that the forkserver method successfully fixes it. Given the difficulty of reproducing the issue, it is also difficult to confirm the fix. So any additional datapoints would be very helpful 😃
I’m running frequent validations and the decoder has been successfully started for the 22nd time. I’ll let it run all night and tell you how it went tomorrow.