ray: Log Monitor TypeError

No reproducible script (never seen this before), but this happened in the middle of my training:

2020-12-02 07:26:37,751	WARNING worker.py:1011 -- The log monitor on node ip-172-31-18-179 failed with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 354, in <module>
    log_monitor.run()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 275, in run
    self.open_closed_files()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 164, in open_closed_files
    self.close_all_files()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py", line 102, in close_all_files
    os.kill(file_info.worker_pid, 0)
TypeError: an integer is required (got type str)

cc @rkooo567

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 32 (27 by maintainers)

Most upvoted comments

@rkooo567 Great, so the only thing required is to add "autoscaler" here?

if (file_info.worker_pid != "raylet"
        and file_info.worker_pid != "gcs_server" and file_info.worker_pid != "autoscaler"):
    os.kill(file_info.worker_pid, 0)

I can submit a pull request, I would like to practice!

Does anyone have a repro? If this was happening in december, then “autoscaler” was probably not the string breaking this.

Alternatively we could just do

try:
    os.kill(int(file_info.worker_pid), 0)
catch ...

Awesome! I’ll reclose this issue then 😃

Can you try with this change and give us what’s the output? https://github.com/ray-project/ray/pull/14271

I think there might be another problem, I went ahead and changed the same line as in the pull request manually in log_monitor.py on our cluster, and I still get this error:

2021-02-22 20:58:27,272 WARNING worker.py:1107 -- The log monitor on node mist007.scinet.local failed with the following error:
Traceback (most recent call last):
  File "/home/l/lstein/ftaj/.conda/envs/drp1/lib/python3.7/site-packages/ray/log_monitor.py", line 359, in <module>
    try:
  File "/home/l/lstein/ftaj/.conda/envs/drp1/lib/python3.7/site-packages/ray/log_monitor.py", line 280, in run
    self.update_log_filenames()
  File "/home/l/lstein/ftaj/.conda/envs/drp1/lib/python3.7/site-packages/ray/log_monitor.py", line 167, in open_closed_files
    # If we can't open any more files. Close all of the files.
  File "/home/l/lstein/ftaj/.conda/envs/drp1/lib/python3.7/site-packages/ray/log_monitor.py", line 102, in close_all_files
    and file_info.worker_pid != "autoscaler"):
TypeError: an integer is required (got type str)

Maybe there’s some other str? or maybe there’s another worker_pid instance somewhere outside of worker_pid = int(job_match.group(3)) like in the first fix? I can only think of adding a print statement and report what I can find.

Thanks for the immidiate action @FarzanT!!!

Oh good discovery. I think we recently added autoscaler. cc @wuisawesome Can you confirm?

Perhaps, I’ll disable multithreading and will update you! Thanks