mercury: Failure with OFI/PSM2
When running the test server/client with PSM2 and OFI I get a failure of the server once the client has started communicating. I’m using:
mpirun -n 1 ./hg_test_server --comm ofi --protocol psm2
To start the test server and:
mpirun -n 2 ./hg_test_perf --comm ofi --protocol psm2
To start the test client.
I am using Intel MPI, and have done the following:
export PSM2_MULTI_EP=1
(without the above the variable set the server will not start). Both the client and server are running on the same node for this test.
If I run the client it connects to the server and sends some data but the server then fails with errors like this:
# Using info string: ofi+psm2://localhost:22222
# Waiting for client...
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:2623
# na_ofi_op_id_valid(): invalid magic number for na_ofi_op_id.
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:2793
# na_ofi_cq_process_event(): Bad na_ofi_op_id, ignoring event.
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:4783
# na_ofi_progress(): Could not process event
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:3019
# hg_core_progress_na_cb(): Could not make progress on NA
# HG Util -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/util/mercury_poll.c:469
# hg_poll_wait(): poll cb failed
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:3260
# hg_core_progress_poll(): hg_poll_wait() failed
# HG -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/mercury_core.c:4826
# HG_Core_progress(): Could not make progress
# NA -- Error -- /home/nx02/nx02/modules/central-package-location/mercury/1.0.1/source/mercury-1.0.1/src/na/na_ofi.c:4824
# na_ofi_cancel(): fi_cancel unexpected recv failed, rc: -11(Resource temporarily unavailable).
Any ideas what’s going wrong?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 18 (9 by maintainers)
Yeah, the upstream spack package doesn’t work right for newer versions of psm2. I’ve got a PR open, but until it lands I’m using my fork/branch for these tests:
https://github.com/spack/spack/pull/11658