ray: [release/1.4.1] scalability envelope test_distributed broken since 1.3
What is the problem?
The log output is filled with these:
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,592 E 20503 20503] logging.cc:441: *** Aborted at 1624357040 (unix time) try "date -d @1624357040" if you are using GNU date ***
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,592 E 20503 20503] logging.cc:441: PC: @ 0x0 (unknown)
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,593 E 20503 20503] logging.cc:441: *** SIGABRT (@0x3e800005017) received by PID 20503 (TID 0x7f37eafd8840) from PID 20503; stack trace: ***
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,593 E 20503 20503] logging.cc:441: @ 0x560add24428f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,593 E 20503 20503] logging.cc:441: @ 0x7f37eabd18a0 (unknown)
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,593 E 20503 20503] logging.cc:441: @ 0x7f37e9cc5f47 gsignal
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,593 E 20503 20503] logging.cc:441: @ 0x7f37e9cc78b1 abort
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,594 E 20503 20503] logging.cc:441: @ 0x560add2303fe ray::SpdLogMessage::Flush()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,594 E 20503 20503] logging.cc:441: @ 0x560add2304cd ray::RayLog::~RayLog()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,595 E 20503 20503] logging.cc:441: @ 0x560add1e512f ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,595 E 20503 20503] logging.cc:441: @ 0x560adcee387c ray::ObjectManager::StartRpcService()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,595 E 20503 20503] logging.cc:441: @ 0x560adcef9ef8 ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,596 E 20503 20503] logging.cc:441: @ 0x560adce825d7 ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,596 E 20503 20503] logging.cc:441: @ 0x560adce20725 ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,596 E 20503 20503] logging.cc:441: @ 0x560adcde8ae0 _ZZ4mainENKUlN3ray6StatusERKN5boost8optionalISsEEE_clES0_S5_
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,596 E 20503 20503] logging.cc:441: @ 0x560adcde997b _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalISsEEEZ4mainEUlS1_S6_E_E9_M_invokeERKSt9_Any_dataOS1_S6_
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,596 E 20503 20503] logging.cc:441: @ 0x560adcfbe11d _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS0_3gcs28ServiceBasedNodeInfoAccessor22AsyncGetInternalConfigERKSt8functionIFvS1_RKN5boost8optionalISsEEEEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,597 E 20503 20503] logging.cc:441: @ 0x560adcf6dea1 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS4_12GcsRpcClient17GetInternalConfigERKNS4_24GetInternalConfigRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,597 E 20503 20503] logging.cc:441: @ 0x560adcf74372 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,598 E 20503 20503] logging.cc:441: @ 0x560adce52b9b _ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,598 E 20503 20503] logging.cc:441: @ 0x560add1e8478 boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,599 E 20503 20503] logging.cc:441: @ 0x560add5b97d1 boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,600 E 20503 20503] logging.cc:441: @ 0x560add5b9901 boost::asio::detail::scheduler::run()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,601 E 20503 20503] logging.cc:441: @ 0x560add5bbb00 boost::asio::io_context::run()
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,601 E 20503 20503] logging.cc:441: @ 0x560adcdb21a0 main
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,601 E 20503 20503] logging.cc:441: @ 0x7f37e9ca8b97 __libc_start_main
(raylet, ip=172.31.16.191) [2021-06-22 10:17:20,602 E 20503 20503] logging.cc:441: @ 0x560adcdd19c5 (unknown)
(raylet, ip=172.31.16.191) E0622 10:12:52.244828773 19159 server_chttp2.cc:40] {"created":"@1624356772.244770874","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1624356772.244769414","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1624356772.244756446","description":"Unable to configure socket","fd":27,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1624356772.244748951","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1624356772.244769010","description":"Unable to configure socket","fd":27,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1624356772.244766816","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,244 C 19159 19159] grpc_server.cc:61: Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,244 E 19159 19159] logging.cc:441: *** Aborted at 1624356772 (unix time) try "date -d @1624356772" if you are using GNU date ***
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: PC: @ 0x0 (unknown)
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: *** SIGABRT (@0x3e800004ad7) received by PID 19159 (TID 0x7f4317da3840) from PID 19159; stack trace: ***
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: @ 0x559ce75d528f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: @ 0x7f431799c8a0 (unknown)
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: @ 0x7f4316a90f47 gsignal
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,245 E 19159 19159] logging.cc:441: @ 0x7f4316a928b1 abort
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,246 E 19159 19159] logging.cc:441: @ 0x559ce75c13fe ray::SpdLogMessage::Flush()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,246 E 19159 19159] logging.cc:441: @ 0x559ce75c14cd ray::RayLog::~RayLog()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,247 E 19159 19159] logging.cc:441: @ 0x559ce757612f ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,247 E 19159 19159] logging.cc:441: @ 0x559ce727487c ray::ObjectManager::StartRpcService()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce728aef8 ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce72135d7 ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce71b1725 ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce7179ae0 _ZZ4mainENKUlN3ray6StatusERKN5boost8optionalISsEEE_clES0_S5_
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce717a97b _ZNSt17_Function_handlerIFvN3ray6StatusERKN5boost8optionalISsEEEZ4mainEUlS1_S6_E_E9_M_invokeERKSt9_Any_dataOS1_S6_
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,248 E 19159 19159] logging.cc:441: @ 0x559ce734f11d _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS0_3gcs28ServiceBasedNodeInfoAccessor22AsyncGetInternalConfigERKSt8functionIFvS1_RKN5boost8optionalISsEEEEEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,249 E 19159 19159] logging.cc:441: @ 0x559ce72feea1 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc22GetInternalConfigReplyEEZNS4_12GcsRpcClient17GetInternalConfigERKNS4_24GetInternalConfigRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,249 E 19159 19159] logging.cc:441: @ 0x559ce7305372 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,250 E 19159 19159] logging.cc:441: @ 0x559ce71e3b9b _ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,251 E 19159 19159] logging.cc:441: @ 0x559ce7579478 boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,252 E 19159 19159] logging.cc:441: @ 0x559ce794a7d1 boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,253 E 19159 19159] logging.cc:441: @ 0x559ce794a901 boost::asio::detail::scheduler::run()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,253 E 19159 19159] logging.cc:441: @ 0x559ce794cb00 boost::asio::io_context::run()
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,253 E 19159 19159] logging.cc:441: @ 0x559ce71431a0 main
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,254 E 19159 19159] logging.cc:441: @ 0x7f4316a73b97 __libc_start_main
(raylet, ip=172.31.16.191) [2021-06-22 10:12:52,255 E 19159 19159] logging.cc:441: @ 0x559ce71629c5 (unknown)
(autoscaler +9m35s) Resized to 4096 CPUs.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (27 by maintainers)
Can we re-run this in 1.4?
Can we try rerunning this? if it only failed on 1 node, it might just be unlucky random number generation