netpyne: random nrn_timeout error for large simulations (128 cores on GCP)

I’m getting quite frequent and random nrn_timeout in my simulations. This is for an M1 model with 10k multi-compartment multi-channel cells, 7k VecStims, 30M synapses, 5-sec simulations, running on 128 cores (4 nodes x 32 cores; 120GB RAM per node) on Google Cloud CentOS 7 virtual machines, with NEURON7.7.1-37-gd9605cb master (d9605cb), NetPyNE v0.9.4 and Python 3.6.8.

Here’s an example error:

nrn_timeout t=4380.62
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.
 
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

A couple more details which might be relevant:

fixed time step (dt = 0.025)
cvode.cache_efficient(True)
cvode.atol(1e-6)
cvode.use_fast_imem(True)
recording all membrane currents via i_membrane_ to calculate LFP at 0.1ms interval (similar implementation to Allen Brain BMTK or LFPy)
recording somatic voltage from 2 cells at 0.1 ms interval

Below is the output log of a successful simulation, which includes info on the distribution of cells and conns, as well as load balance at the end:


numprocs=128

Start time:  2019-12-22 22:55:08.962242

Reading command line arguments using syntax: python file.py [simConfig=filepath] [netParams=filepath]
Loading file ../data/v56_batch2b/v56_batch2b_0_0_cfg.json ... 
Loading simConfig...
Importing netParams from ../data/v56_batch2b/v56_batch2b_netParams.py

Creating network of 22 cell populations on 128 hosts...
  Number of cells on node 9: 134 
  Number of cells on node 2: 134 
  
[...]
  
  Number of connections on node 5: 68670 
  Number of synaptic contacts on node 5: 278168 
  
[...]

Running simulation for 5000.0 ms...

  Done; run time = 24927.95 s; real-time ratio: 0.00.

Gathering data...
  Done; gather time = 5.70 s.

Analyzing...
  Cells: 17073
  Connections: [removed during gathering to save memory]
  Spikes: 373727 (4.38 Hz)
   IT2 : 5.818 Hz
   SOM2 : 11.850 Hz
   PV2 : 2.940 Hz
   IT4 : 0.754 Hz
   IT5A : 38.843 Hz
   SOM5A : 1.442 Hz
   PV5A : 32.018 Hz
   IT5B : 1.072 Hz
   PT5B : 5.916 Hz
   SOM5B : 0.729 Hz
   PV5B : 26.749 Hz
   IT6 : 1.877 Hz
   CT6 : 0.178 Hz
   SOM6 : 10.056 Hz
   PV6 : 0.567 Hz
   TPO : 2.498 Hz
   TVL : 0.000 Hz
   S1 : 2.563 Hz
   S2 : 2.470 Hz
   cM1 : 1.254 Hz
   M2 : 1.322 Hz
   OC : 2.495 Hz
  Simulated time: 5.0 s; 128 workers
  Run time: 24927.95 s
Saving output as ../data/v56_batch2b/v56_batch2b_0_0.json  ... 
Finished saving!
  Done; saving time = 4.43 s.
Plotting LFP ...
Plotting raster...
Plotting recorded cell traces ... trace
  Done; plotting time = 150.12 s

Total time = 25606.27 s

End time:  2019-12-23 06:01:55.235592
max_comp_time: 24692.92587293384
min_comp_time: 20009.316901733127
avg_comp_time: 21740.200874486483
load_balance: 0.8804222304946103

spike exchange time (run_time-comp_time):  235.026302206541

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 33

Most upvoted comments

Fixed the 96-core issue by running:

export OMP_NUM_THREADS=1
export USE_SIMPLE_THREADED_LEVEL3=1

So I now have 10 sims each running on a single 96-core node … let’s see if get timeout here or not…

salvadord on Jan 3, 2020

That should be good for up to 200k ranks 😃

nrnhines on Dec 30, 2019

git fetch git pull collective-timeout-debug Build and run as normal and see if it prints anything from the printf

printf("%d calls %d but 0 calls %d\n", nrnmpi_myid, i, buf);

nrnhines on Dec 27, 2019