ray: [autoscaler] wrongly shuts down all nodes due to one bad node.
the autoscaler tries take down one idle (false positive though, the node was running at 100% cpu) node but end up killing every nodes due to an internal key error. It seems to get confused with the mapping. this is a serious issue as all my progress get lost. I was using 16 placement group (on one machine each).
2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Removing 1 nodes of type cpu_48_spot (idle).
2021-02-17 14:55:34,817 INFO monitor.py:207 – :event_summary:Adding 1 nodes of type cpu_48_spot.
2021-02-17 14:55:40,430 INFO load_metrics.py:102 – LoadMetrics: Removed mapping: 172.31.23.116 - 1613573430.7000167
2021-02-17 14:55:40,430 INFO load_metrics.py:109 – LoadMetrics: Removed 1 stale ip mappings: {‘172.31.23.116’} not in {‘172.31.16.240’, ‘172.31.27.173’, ‘172.31.26.163’, ‘172.31.20.177’, ‘172.31.25.79’, ‘172.31.28.159’, ‘172.31.21.227’, ‘172.31.24.131’, ‘172.31.31.164’, ‘172.31.22.24’, ‘172.31.26.41’, ‘172.31.19.126’, ‘172.31.22.66’, ‘172.31.26.13’, ‘172.31.30.105’, ‘172.31.25.157’, ‘172.31.27.26’}
2021-02-17 14:55:40,744 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:40,744 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:55:46,909 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:46,909 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:55:47,082 INFO monitor.py:207 – :event_summary:Resized to 724 CPUs.
2021-02-17 14:55:52,997 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:52,998 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:55:58,965 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:55:58,965 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:56:05,002 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:05,003 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:56:10,999 ERROR autoscaler.py:266 – StandardAutoscaler: i-02b77234ffad2072c: Terminating failed to setup/initialize node.
2021-02-17 14:56:11,000 ERROR autoscaler.py:139 – StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:56:11,001 CRITICAL autoscaler.py:152 – StandardAutoscaler: Too many errors, abort.
2021-02-17 14:56:11,001 ERROR monitor.py:271 – Error in monitor loop
Traceback (most recent call last):
File “/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 269, in run
self._run()
File “/home/centos/.local/lib/python3.7/site-packages/ray/monitor.py”, line 202, in _run
self.autoscaler.update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 154, in update
raise e
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 137, in update
self._update()
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 270, in _update
self._get_node_type(node_id) + " (launch failed).",
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py”, line 598, in _get_node_type
node_tags = self.provider.node_tags(node_id)
File “/home/centos/.local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py”, line 170, in node_tags
d1 = self.tag_cache[node_id]
KeyError: ‘i-02b77234ffad2072c’
2021-02-17 14:56:11,002 ERROR autoscaler.py:724 – StandardAutoscaler: kill_workers triggered
2021-02-17 14:56:11,453 ERROR autoscaler.py:729 – StandardAutoscaler: terminated 16 node(s)
2021-02-17 14:56:11,453 INFO monitor.py:250 – Monitor: Exception caught. Taking down workers…
2021-02-17 14:56:11,680 INFO monitor.py:262 – Monitor: Workers taken down.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (11 by maintainers)
@jennicetao thanks for the report. What does your workload look like? Do you have unused placement groups in the cluster?
As a short term fix to unblock yourself, can you set
idle_timeout_minutes: 999999in your cluster config for now?