secretflow: Get data timeout, key=root:110:ALLGATHER
Issue Type
Others
Source
binary
Secretflow Version
latest
OS Platform and Distribution
ubuntu 18.04
Python version
3.8.13
Bazel version
No response
GCC/Compiler version
No response
What happend and What you expected to happen.
2022-07-28 16:16:13,219 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::SPURuntime.run() (pid=13081, ip=10.100.82.74, repr=<secretflow.device.device.spu.SPURuntime object at 0x7f1fd47b1220>)
File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 224, in run
self.runtime.run(executable)
File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/spu/binding/api.py", line 43, in run
return self._vm.Run(executable.SerializeToString())
RuntimeError: what:
[external/yasl/yasl/link/transport/channel.cc:86] Get data timeout, key=root:110:ALLGATHER
stacktrace:
#0 yasl::link::Context::RecvInternal()+0x7f202eb100b2
#1 yasl::link::AllGatherImpl<>()+0x7f202e9c8785
#2 yasl::link::AllGather()+0x7f202e9c8cb4
#3 spu::mpc::Communicator::allReduce()+0x7f202e2c7a37
#4 spu::mpc::semi2k::B2A_Randbit::proc()::{lambda()#1}::operator()()::{lambda()#3}::operator()()+0x7f202e2bd9f2
#5 spu::mpc::semi2k::B2A_Randbit::proc()+0x7f202e2c0a89
#6 spu::mpc::UnaryKernel::evaluate()+0x7f202e19efdb
#7 spu::mpc::Object::call<>()+0x7f202e2c60b8
#8 spu::mpc::(anonymous namespace)::_Lazy2A()+0x7f202e2dfb19
#9 spu::mpc::ABProtAddSP::proc()+0x7f202e2e019b
#10 spu::mpc::BinaryKernel::evaluate()+0x7f202e19f2f2
#11 spu::mpc::Object::call<>()+0x7f202e2c6866
#12 spu::mpc::add_sp()+0x7f202e2c6994
#13 spu::hal::_add_sp()+0x7f202e171b63
#14 spu::hal::_add()+0x7f202e167486
#15 spu::hal::_popcount()+0x7f202e168b8c
Reproduction code to reproduce the issue.
在做三方逻辑回归时,遇到上述报错。似乎和训练的数据量有关系。这块如果代码不调整的话,是否只能升级机器配置或加计算节点优化呢?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (8 by maintainers)
Commits related to this issue
- Fix build (#70) * Fix build * Update yacl again — committed to ElleryQu/secretflow by anakinxc a year ago
Hi @mingo0117 ,
首先,你需要通过设置spu的config来开启相应的log:
然后,你需要在secretflow init的时候打开log_to_driver,类似于