netpyne: random nrn_timeout error for large simulations (128 cores on GCP)
I’m getting quite frequent and random nrn_timeout in my simulations. This is for an M1 model with 10k multi-compartment multi-channel cells, 7k VecStims, 30M synapses, 5-sec simulations, running on 128 cores (4 nodes x 32 cores; 120GB RAM per node) on Google Cloud CentOS 7 virtual machines, with NEURON7.7.1-37-gd9605cb master (d9605cb), NetPyNE v0.9.4 and Python 3.6.8.
Here’s an example error:
nrn_timeout t=4380.62
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
A couple more details which might be relevant:
- fixed time step (dt = 0.025)
- cvode.cache_efficient(True)
- cvode.atol(1e-6)
- cvode.use_fast_imem(True)
- recording all membrane currents via i_membrane_ to calculate LFP at 0.1ms interval (similar implementation to Allen Brain BMTK or LFPy)
- recording somatic voltage from 2 cells at 0.1 ms interval
Below is the output log of a successful simulation, which includes info on the distribution of cells and conns, as well as load balance at the end:
numprocs=128
Start time: 2019-12-22 22:55:08.962242
Reading command line arguments using syntax: python file.py [simConfig=filepath] [netParams=filepath]
Loading file ../data/v56_batch2b/v56_batch2b_0_0_cfg.json ...
Loading simConfig...
Importing netParams from ../data/v56_batch2b/v56_batch2b_netParams.py
Creating network of 22 cell populations on 128 hosts...
Number of cells on node 9: 134
Number of cells on node 2: 134
[...]
Number of connections on node 5: 68670
Number of synaptic contacts on node 5: 278168
[...]
Running simulation for 5000.0 ms...
Done; run time = 24927.95 s; real-time ratio: 0.00.
Gathering data...
Done; gather time = 5.70 s.
Analyzing...
Cells: 17073
Connections: [removed during gathering to save memory]
Spikes: 373727 (4.38 Hz)
IT2 : 5.818 Hz
SOM2 : 11.850 Hz
PV2 : 2.940 Hz
IT4 : 0.754 Hz
IT5A : 38.843 Hz
SOM5A : 1.442 Hz
PV5A : 32.018 Hz
IT5B : 1.072 Hz
PT5B : 5.916 Hz
SOM5B : 0.729 Hz
PV5B : 26.749 Hz
IT6 : 1.877 Hz
CT6 : 0.178 Hz
SOM6 : 10.056 Hz
PV6 : 0.567 Hz
TPO : 2.498 Hz
TVL : 0.000 Hz
S1 : 2.563 Hz
S2 : 2.470 Hz
cM1 : 1.254 Hz
M2 : 1.322 Hz
OC : 2.495 Hz
Simulated time: 5.0 s; 128 workers
Run time: 24927.95 s
Saving output as ../data/v56_batch2b/v56_batch2b_0_0.json ...
Finished saving!
Done; saving time = 4.43 s.
Plotting LFP ...
Plotting raster...
Plotting recorded cell traces ... trace
Done; plotting time = 150.12 s
Total time = 25606.27 s
End time: 2019-12-23 06:01:55.235592
max_comp_time: 24692.92587293384
min_comp_time: 20009.316901733127
avg_comp_time: 21740.200874486483
load_balance: 0.8804222304946103
spike exchange time (run_time-comp_time): 235.026302206541
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 33
Fixed the 96-core issue by running:
So I now have 10 sims each running on a single 96-core node … let’s see if get timeout here or not…
That should be good for up to 200k ranks 😃
git fetch git pull collective-timeout-debug Build and run as normal and see if it prints anything from the printf