mercury: Failure with OFI/PSM2

When running the test server/client with PSM2 and OFI I get a failure of the server once the client has started communicating. I’m using:

mpirun -n 1  ./hg_test_server --comm ofi --protocol psm2

To start the test server and:

mpirun -n 2 ./hg_test_perf --comm ofi --protocol psm2 

To start the test client.

I am using Intel MPI, and have done the following:

export PSM2_MULTI_EP=1

(without the above the variable set the server will not start). Both the client and server are running on the same node for this test.

If I run the client it connects to the server and sends some data but the server then fails with errors like this:

# Using info string: ofi+psm2://localhost:22222
# Waiting for client...
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:2623
 # na_ofi_op_id_valid(): invalid magic number for na_ofi_op_id.
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:2793
 # na_ofi_cq_process_event(): Bad na_ofi_op_id, ignoring event.
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:4783
 # na_ofi_progress(): Could not process event
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:3019
 # hg_core_progress_na_cb(): Could not make progress on NA
# HG Util -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/util/mercury_poll.c:469
 # hg_poll_wait(): poll cb failed
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:3260
 # hg_core_progress_poll(): hg_poll_wait() failed
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:4826
 # HG_Core_progress(): Could not make progress
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:4824
 # na_ofi_cancel(): fi_cancel unexpected recv failed, rc: -11(Resource temporarily unavailable).

Any ideas what’s going wrong?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 18 (9 by maintainers)

Most upvoted comments

Yeah, the upstream spack package doesn’t work right for newer versions of psm2. I’ve got a PR open, but until it lands I’m using my fork/branch for these tests:

https://github.com/spack/spack/pull/11658