ray: Log Monitor TypeError
No reproducible script (never seen this before), but this happened in the middle of my training:
2020-12-02 07:26:37,751 WARNING worker.py:1011 -- The log monitor on node ip-172-31-18-179 failed with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 354, in <module>
log_monitor.run()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 275, in run
self.open_closed_files()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 164, in open_closed_files
self.close_all_files()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 102, in close_all_files
os.kill(file_info.worker_pid, 0)
TypeError: an integer is required (got type str)
cc @rkooo567
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32 (27 by maintainers)
@rkooo567 Great, so the only thing required is to add
"autoscaler"here?I can submit a pull request, I would like to practice!
Does anyone have a repro? If this was happening in december, then “autoscaler” was probably not the string breaking this.
Alternatively we could just do
Awesome! I’ll reclose this issue then 😃
Can you try with this change and give us what’s the output? https://github.com/ray-project/ray/pull/14271
I think there might be another problem, I went ahead and changed the same line as in the pull request manually in
log_monitor.pyon our cluster, and I still get this error:Maybe there’s some other
str? or maybe there’s anotherworker_pidinstance somewhere outside ofworker_pid = int(job_match.group(3))like in the first fix? I can only think of adding a print statement and report what I can find.Thanks for the immidiate action @FarzanT!!!
Oh good discovery. I think we recently added
autoscaler. cc @wuisawesome Can you confirm?Perhaps, I’ll disable multithreading and will update you! Thanks
Maybe this? https://github.com/pytorch/pytorch/issues/1551