rsmpi: Immediate send/recvs hang when the packets being sent are large
Say I have 4 MPI processes labelled: P0, P1, P2, P3. Each process potentially has packets to send to other processes, but may not.
I.e. P0 needs to send packets to P1 and P2, or
P0->[P1, P2]
Similarly,
P1->[P3]
P2 ->[]
P3 -> [P1]
So P1 has to receive potential packets from both P0 and P3, and P3 has to receive packets from P1, and P2 from P0.
After doing an all-reduce to calculate the receive counts at each process, I’m sending the packets as follows:
for (i, packet) in packets.iter().enumerate() {
let partner_process = world.process_at_rank(packet_destinations[i]);
mpi::request::scope(|scope| {
let _sreq = WaitGuard::from(partner_process.immediate_send(scope, &packet[..]));
});
}
for (i, &recv_rank) in received_packet_sources.iter().enumerate() {
let partner_process = world.process_at_rank(recv_rank);
mpi::request::scope(|scope| {
let _rreq = WaitGuard::from(partner_process.immediate_receive_into(scope, &mut buffers[i][..]));
});
}
Where the sends and receives will in general be from different processes for each given process. This code works ok when the packets are quite small (<100 elements) but it hangs for larger packet sizes.
C Code for the above would be ok, there would be no wait guard, but you’d add a wait all at the end of the two for loops. How do I replicate this in Rust? Would appreciate any pointers.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16 (7 by maintainers)
Just replace
immediate_ready_sendwithimmediate_send.Yeah, that’s what I figured so I added this comment. (I’ll merge this with some more housekeeping bits, which will probably include making ready-mode
unsafe.) https://github.com/rsmpi/rsmpi/commit/1e5eeabc7f2b7f55976d3bc8df0867a229d03425#diff-08ccf39fba6d59419fdacc6fa9fb0003071407cc2792c0a8811369aea0fc57d1R34-R36See also #182 – I’m inclined to make ready send
unsafein the next release.In your example, the receive
remote_opneeds to actually have the size. In this case, that looks likevec![0; 12]instead ofVec::new(). Also, you can’t use*ready_sendunless you can guarantee that the receiver has already posted a matching receive. That’s a race condition here, so the present code is noncompliant. I think it’s okay once you fix those two things.