mpi4py: mpi4py error during getting results (in pare with SLURM)

ERROR: Traceback (most recent call last): File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/main.py”, line 72, in <module> main() File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/main.py”, line 60, in main run_command_line() File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/run.py”, line 47, in run_command_line run_path(sys.argv[0], run_name=‘main’) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 263, in run_path pkg_name=pkg_name, script_name=fname) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File “/opt/software/anaconda/3/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “cali_send_2.py”, line 137, in <module> globals()[sys.argv[1]](sys.argv[2], sys.argv[3]) File “cali_send_2.py”, line 94, in solve_on_cali sols = list(executor.map(solve_matrix, repeat(inputs), range(len(wls)), wls)) File “/home/vasko/.local/lib/python3.6/site-packages/mpi4py/futures/pool.py”, line 207, in result_iterator yield futures.pop().result() File “/opt/software/anaconda/3/lib/python3.6/concurrent/futures/_base.py”, line 432, in result return self.__get_result() File “/opt/software/anaconda/3/lib/python3.6/concurrent/futures/_base.py”, line 384, in __get_result raise self._exception UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xc1 in position 5: invalid start byte

ENV CentOS release 6.5 (Final) Python 3.6 anaconda mpiexec (OpenRTE) 1.8.2 mpi4py 3.0.3

Piece of Code:

inputs = [der_mats, ref_ind_yee_grid, n_xy_sq, param_sweep_on, i_m, inv_eps, sol_params]
with MPIPoolExecutor(max_workers=int(nodes)) as executor:
   sols = list(executor.map(solve_matrix, repeat(inputs), range(len(wls)), wls))
   executor.shutdown(wait=True)  # wait for all complete
   zipobj = ZipFile(zp_fl_nm, 'w')

   for sol in sols:
      w, v, solnum, vq = sol
      print(w[0], solnum) # this line will shows if data have duplicates.
      w.tofile(f"w_sol_{solnum}.npy")
      v.tofile(f"v_sol_{solnum}.npy")
      vq.tofile(f"vq_sol_{solnum}.npy")
      zipobj.write(f"w_sol_{solnum}.npy")
      zipobj.write(f"v_sol_{solnum}.npy")
      zipobj.write(f"vq_sol_{solnum}.npy")
      os.remove(f"w_sol_{solnum}.npy")
      os.remove(f"v_sol_{solnum}.npy")
      os.remove(f"vq_sol_{solnum}.npy")

Call of method I do with sending command like this: f'srun --mpi=pmi2 -n ${{SLURM_NTASKS}} python -m mpi4py.futures cali_send_2.py solve_on_cali \"\"{name}\"\" {num_nodes}'

Sometimes this error not appear if I use another range for wls with (wls = np.arange(0.4e-6, 1.8e-6, 0.01e-6)) it crush with this error or return duplicates of some solutions if step 0.1e-6. If I use this range (wls = np.arange(0.55e-6, 1.55e-6, 0.01e-6)) with any step 0.1e-6 or 0.001e-6 it’s NOT crush with mentioned error and returns good results without duplicates.

Could someone please explain me what is the origin of this error? My suspicion is pointing on float numbers like 1.699999999999999999999e-6

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

@byquip You are using Python from a miniconda environment, however mpi4py is installed in $HOME/.local. That’s suspicious, conda users should just pip install in the environment. Or perhaps the problem is what @leofang pointed out, the environment is not active in all the compute nodes.

dalcinl on Apr 7, 2021

This kind of questions is better suited for mpi4py’s mailing list in Google Groups. I understand that shooting an issue in GitHub is very convenient for users, but this increases the load on core developers, and the community watching the mailing list is usually larger. Chaces of getting a good tip and advice are higher on the mailing list.

dalcinl on Apr 7, 2021

Any chance you forgot to activate the same Python environment across all compute nodes?

leofang on Apr 7, 2021