rsmpi: Immediate send/recvs hang when the packets being sent are large

Say I have 4 MPI processes labelled: P0, P1, P2, P3. Each process potentially has packets to send to other processes, but may not.

I.e. P0 needs to send packets to P1 and P2, or

    P0->[P1, P2]

Similarly,

    P1->[P3] 
    P2 ->[]
    P3 -> [P1]

So P1 has to receive potential packets from both P0 and P3, and P3 has to receive packets from P1, and P2 from P0.

After doing an all-reduce to calculate the receive counts at each process, I’m sending the packets as follows:

    for (i, packet) in packets.iter().enumerate() {
        let partner_process = world.process_at_rank(packet_destinations[i]);
        mpi::request::scope(|scope| {
            let _sreq = WaitGuard::from(partner_process.immediate_send(scope, &packet[..]));
        });
    }

    for (i, &recv_rank) in received_packet_sources.iter().enumerate() {
        let partner_process = world.process_at_rank(recv_rank);
        mpi::request::scope(|scope| {
            let _rreq = WaitGuard::from(partner_process.immediate_receive_into(scope, &mut buffers[i][..]));
        }); 
    }

Where the sends and receives will in general be from different processes for each given process. This code works ok when the packets are quite small (<100 elements) but it hangs for larger packet sizes.

C Code for the above would be ok, there would be no wait guard, but you’d add a wait all at the end of the two for loops. How do I replicate this in Rust? Would appreciate any pointers.

About this issue

Original URL
State: open
Created a year ago
Comments: 16 (7 by maintainers)

Most upvoted comments

Just replace immediate_ready_send with immediate_send.

jedbrown on Apr 15, 2024

Yeah, that’s what I figured so I added this comment. (I’ll merge this with some more housekeeping bits, which will probably include making ready-mode unsafe.) https://github.com/rsmpi/rsmpi/commit/1e5eeabc7f2b7f55976d3bc8df0867a229d03425#diff-08ccf39fba6d59419fdacc6fa9fb0003071407cc2792c0a8811369aea0fc57d1R34-R36

jedbrown on Apr 15, 2024

See also #182 – I’m inclined to make ready send unsafe in the next release.

jedbrown on Apr 15, 2024

In your example, the receive remote_op needs to actually have the size. In this case, that looks like vec![0; 12] instead of Vec::new(). Also, you can’t use *ready_send unless you can guarantee that the receiver has already posted a matching receive. That’s a race condition here, so the present code is noncompliant. I think it’s okay once you fix those two things.

jedbrown on Apr 14, 2024