perses: Replica propagation times systematically change in cyclic patterns that may slow simulations

In plotting the total replica propagation time per iteration from one of @hannahbrucemacdonald’s experiments (/data/chodera/brucemah/relative_paper/dec/jnk1/lig20to8/complex_0.stderr), it appears that there is a pattern: image (5) Connecting successive iterations with lines reveals a clearer pattern: image (6) Zooming in shows the pattern more clearly: image (7) This is currently just something intriguing, but something we should investigate in the new year since it may lead to speed improvements if we can understand what is going on here.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 133 (99 by maintainers)

Commits related to this issue

Most upvoted comments

Here’s a clue: If I replace the mcmc_moves—which are LangevinDynamicsSplittingMoves—with LangevinDynamicsMoves, the periodic behavior goes away.

I suggest we may want to use LangevinDynamicsMove for now and I can debug the MCMC LangevinDynamicsSplittingMove stuff separately. It may have to do with the LangevinSplittingIntegrator, which @maxentile and I need to refine anyway.

I think I understand part of what’s going on. I don’t know exactly what’s leading to the specific sawtooth shape we’re seeing. I think it relates to the details of the sampling algorithm, which I’m not familiar with. I’ll explain what I’ve found, and then maybe you’ll be able to fill in the missing pieces.

First some background on how OpenMM computes nonbonded interactions. It uses a coarse-grained neighbor list. Instead of finding the neighbors of individual atoms, it finds the neighbors of blocks of 32 contiguous atoms. This is only efficient if those blocks are compact. Atoms whose indices are close also need to be close in space. That’s fine for proteins, since the atoms are in sorted order by residue. But water is a problem. Each molecule can diffuse independently of every other.

To address that, it sorts the water molecules in a way that groups spatially nearby ones close together in order. That makes the neighbor list efficient. But it becomes less efficient with time as the waters move, so it periodically repeats the sorting.

In this script, a single Context is used for 24 replicas. It loops over them, setting the coordinates for each replica in turn and running just a small number of time steps with it. That means it is frequently simulating one replica with a water ordering that was created for a different replica, and hence is inefficient. Within a few hundred steps it will reorder them and become efficient again. But then it promptly switches to yet another replica.

That’s what seems to be going on. When the iteration time goes up, it’s because the neighbor list has become less efficient and it has to compute more interactions. And if I make it reorder atoms more often, the variation gets smaller (though of course it also has to spend more time sorting atoms).

Triggering the reordering of atoms does work around the issue. I tried what @jchodera suggested in this comment and the results are as follow. Sorry for the crowded plot but I think it’s readable enough. image The ones labeled with “fix” are the ones with the change, the impact on performance is not too much, and it’s also better if we use the UseBlockingSync: false option (maybe we should make this the default?).

The localbuild is openmm at the commit @zhang-ivy pointed in a previous comment, which corresponds to her environment and to the openmm-7.7.0-dev1 package at conda-forge.

I implemented the logic described above in https://github.com/choderalab/perses/issues/613#issuecomment-1199622533. It mostly works, but it ends up having unintended consequences. Every time the barostat runs, it triggers a call to setPositions(), which means reordering gets done much more often before. I could work around that, but I think it would be good to consider other approaches as well. For example, I’m thinking about a method that would monitor the size of the neighbor list and trigger a reordering whenever it sees the neighbor list has grown too much. The goal is to find something robust that will automatically do the right thing in nearly all cases.

Ok, thanks. I think the behavior we want is something like this.

  1. If you call setPositions(), and then immediately follow with a call to step(), we should immediately perform reordering. If you take one step, you’re probably going to take lots of steps.
  2. If you call setPositions() and then getState(), we should not do reordering. You may only intend a single evaluation with those coordinates.
  3. But in case 2, we still need to remember that you set the positions with setPositions(). If you follow getState() with step(), at that point we should immediately do reordering.
  4. If the positions get updated through any means other than a call to setPositions(), we should maintain the current behavior. Lots of classes update positions in other ways through internal APIs, but usually only for small changes to the positions.
  5. Preferably this should be done in a way that maintains source compatibility with existing plugins.

I think the important distinction is whether all the threads end up using the same OpenMM Context object, or whether each one ends up using a different Context. Does that match what you see? You can verify whether they’re the same or different by printing out hash(context) for each one.

You can time everything happening inside the integrator step in the same way. Just inside the start of this loop add the line

double startTime = getCurrentTime();

And then here just before step = nextStep (not after!) add the lines

double endTime = getCurrentTime();
printf("step %d type %d time %g\n", step, stepType[step], endTime-startTime);

That will give you a time for executing every step of the integration algorithm. stepType will be one of the enumerated constants at https://github.com/openmm/openmm/blob/f477b106fc946c4cb3b4ae7d61da43889e568c9e/openmmapi/include/openmm/CustomIntegrator.h#L344-L381. You can also time the final section of the method where it calls recordChangedParameters() and reorderAtoms() to be thorough.

@ijpulidos : Is there a way to measure how long is spent in the reorderAtoms() step, and how many times it is called, without recompiling? Are you able to profile using the nightly OpenMM builds which have debug symbols built in?

Here’s what I’ve found so far:

  • This detailed description of inspecting and controlling your GPU settings on linux is useful, but seems limited in what can be done with GTX cards compared to Tesla cards (e.g. V100)
  • Tesla cards also support fancy datacenter GPU manager software, but GTX cards appear not to
  • There are some driver settings for GPU boosting that may be relevant, but I’m not sure if they work on the GTX cards
  • This post shows some ideas on how to increase the memory clock speed
  • This thread discusses the issues of being locked into P2 in more detail
  • This thread specifically talks about being locked into P2 in compute mode on GTX 1080 Tis
  • This thread presents an intriguing solution that uses a driver setting that appears to be totally undocumented (modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0") to force into P0:
rmmod nvidia; modprobe nvidia NVreg_RegistryDwords="RMForcePstate=0"; nvidia-smi -pm 1; sleep 30; nvidia-smi 

Edit: It appears that this is described as an option for the windows-only nvidiaInspector, but the above presents a way to do this in linux.

  • This post talks more about power mizer settings and how to disable it.

Cool thanks - I’ll look into

  • power usage
  • node name
  • driver version
  • GPU type
  • Power State

Hopefully there is some more output in /data/chodera/brucemah/relative_paper/amber_starting/off100/thrombin/test/SLOWER1to10 for you to see in the AM!