(Parte 1 de 4)

GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation

Berk Hess*

Max-Planck Institute for Polymer Research, Ackermannweg 10, D-55128 Mainz, Germany

Carsten Kutzner

Department of Theoretical and Computational Biophysics, Max-Planck-Institute of Biophysical Chemistry, Am Fassberg 1, D-37077 Gottingen, Germany

David van der Spoel

Department of Cell and Molecular Biology, Uppsala UniVersity, Husargatan 3, Box 596, SE-75124 Uppsala, Sweden

Erik Lindahl

Stockholm Center for Biomembrane Research, Stockholm UniVersity, SE-10691 Stockholm, Sweden

Received November 7, 2007

Abstract: Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of ourmolecularsimulationtoolkitGROMACSwhichnowbothachievesextremelyhighperformance on singleprocessorsfrom algorithmicoptimizationsand hand-codedroutinesand simultaneously scalesvery well on parallelmachines.The code encompassesa minimal-communicationdomain decompositionalgorithm,full dynamic load balancing,a state-of-the-artparallel constraintsolver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithmsenable extremelylong simulationsof large systems but also it providesthat simulation performance on quite modest numbers of standard cluster nodes.

I. Introduction

Over the last few decades, molecular dynamics simulation has become a common tool in theoretical studies both of simple liquids and large biomolecular systems such as proteins or DNA in realistic solvent environments. The computational complexity of this type of calculations has historically been extremely high, and much research has thereforefocused on algorithmsto achieve single simulations that are as long or large as possible. Some of the key early work was the introduction of holonomic bond length constraints1 and rigid-body water models2,3 to enable longer integration time steps. However, one of the most important general developments in the field was the introduction of parallelmolecularsimulationimplementationsduring the late

10.1021/ct700301q C: $40.75 © 2008 American Chemical Society Published on Web 02/02/2008

1980s and early 1990s.4-7 The NAMD program by the Schulten group8 was the first to enable scaling of large molecular simulations to hundreds of processors, Duan and Kollman were able to complete the first microsecond simulation of a protein by creating a special parallel version of Amber, and more recently Fitch et al. have taken scaling to the extreme with their BlueMatter code which can use all tens of thousands of nodes on the special BlueGene hardware.9

On the other hand, an equally strong trend in the field has been the change of focus to statistical properties like free energy of solvation or binding of small molecules and, e.g., protein folding rates. For this class of problems (limited by sampling) the main bottleneck is single-CPU performance, since it is typically always possible to achieve perfect scaling on any cluster by starting hundreds of independent simulations with slightlydifferentinitialconditions.This has always been a central theme in GROMACS development and perhapsbestshowcasedby its adoptionin the Folding@Home project, where it is running on hundreds of thousands of independent clients.10 GROMACS achieves exceptional single-CPUperformancebecauseof the manuallytuned SSE, SSE2, and ALTIVEC force kernels, but there are also many algorithmicoptimizations,for instance single-sumvirials and strength-reduced algorithms to allow single-precision floating-point arithmetic in all places where it still conserves energy (which doubles memory and cache bandwidth).1,12 In the benchmarksectionwe show that GROMACSin single precision matches the energy conservation of a double precision package.

Unfortunately it is far from trivial to combine raw single-

CPU performance and scaling, and in many cases there are inherent tradeoffs. It is for instance straightforward to constrain all bond lengths on a single CPU, but in parallel it is usuallyonly appliedto bonds involvinghydrogensto avoid (iterative) communication, which in turn puts a lower limit on the possible time step.

In this paper, we present a completely reworked parallelization algorithm that has been implemented in GROMACS. However, rather than optimizing relative scaling over N CPUs we have focused on (i) achieving the highest possible absolute performance and (i) doing so on as few processors as possible since supercomputer resources are typically scarce. A key challenge has therefore been to make sure all algorithmsused to improvesingle-CPUperformancethrough longer time steps such as holonomic bond constraints, replacing hydrogens with virtual interaction sites,13 and arbitrary triclinic unit cells also work efficiently in parallel.

GROMACS was in fact set up to run in parallelon 10Mbit ethernet from the start in 19927 but used a particle/force decompositionthat did not scale well. The single-instructionmultiple-datakernelswe introducedin 2000 made the relative scalingeven worse (althoughabsoluteperformanceimproved significantly), since the fraction of remaining time spent on communication increased. A related problem was load imbalance; with particle decomposition one can frequently avoid imbalance by distributing different types of molecules uniformly over the processors. Domain decomposition, on the other hand, requires automatic load balancing to avoid deterioration of performance. This load imbalance typically occurs in three cases: The most obvious reason is an uneven distribution of particles in space, such as a system with a liquid-vapor coexistence.A second reason is imbalance due to different interaction densities. In biomolecular systems the atom densityis usuallynearlyuniform,but when a unitedatom forcefield is used hydrocarbon segments (e.g., in lipid chains) have a three times lower particle density and these particles have only Lennard-Jones interactions. This makes the computation of interactions of a slab of lipids an order of magnitude faster than a slab of water molecules. Interaction density imbalance is also an issue with all-atom force fields in GROMACS, since the program provides optimized water-water loops for standard SPC/TIP3P/TIP4P waters with Lennard-Jones interactions only on the oxygens.2,3 (In principle it is straightforward to introduce similar optimizationfor the CHARMM-stylemodifiedTIP watermodelswith Lennard-Jones interactions on the hydrogens too, but since there is no clear advantage from the extra interactions we have not yet done so.) A third reason for load imbalance is statistical fluctuation of the number of particles in a domain decomposition cell. This primarily plays a role when cells only contain a few hundred atoms.

Anothermajor issue for simulationof large moleculessuch as proteins was the fact that atoms connected by constraints could not be split over processors (holonomic constraints) a problem shared with all other biomolecular simulation packages (the alternative being shorter time-steps, possible coupled with multiple-time-step integration). This issue is more acute with domain decomposition, since even small molecules in general do not reside in a single domain.

Finally, the last challenge was the nonimpressive scaling of the Particle Mesh Ewald (PME) electrostatics14 as implemented in the previous GROMACS version. Since PME involves two 3D fast Fourier transforms (FFTs), it requires global all-to-all communication where the number of messages scale as the square of the number of nodes involved. There have been several attempts at parallelizing PME using iterative solvers instead of using FFTs. A different algorithm that reduces communication is fast multipole expansion.15 However, presently none of these methods combine the efficiency of PME using FFTs with good scaling up to many processors.

We have addressedthese four issuesby devisingan eighthshell domain decomposition method coupled to a full dynamic load-balancing algorithm with a minimum amount of communication, a parallel version of the constraint algorithmLINCS that enables holonomicconstraintswithout iterative communication, and splitting off the PME calculation to dedicated PME processors. These four key advances will be described in the next three sections, followed by a description of other new features and a set of benchmarks to illustrate both absolute performance and relative scaling.

I. Domain Decomposition

Recently,the D. E. Shaw grouphas performedseveralstudies into general zonal methods16 for parallelization of particlebased interactions. In zonal (or neutral territory) methods, forces between particles i and j are not necessarily calculated

436 J. Chem. Theory Comput., Vol. 4, No. 3, 2008 Hess et al.

on a processor where either of particles i or j resides. Somewhat paradoxically, such methods can be significantly more efficient than traditional domain decomposition methods since they reducethe total amountof data communicated. Two methods achieve the least communication when the domain size is not extremely small compared to the cutoff radius; these two methods were termed eighth shell17 and midpointmethods18 by Shaw and co-workers.In the half shell method, interactions between particle i and j are calculated in the cell where i or j resides.The minimum communication required for such a method is half of the volume of a boundaryof a thicknessequal to the cutoff radius. The eighth shell methodimproveson this by also calculatinginteractions between particles that reside in different communicated zones. The communicatedvolume of the eighth shell method is thus a subset of that of the half shell method, and it also requires less communication steps which helps reduce latency.

The basic eighth shell method was already described in 1991 by Liem et al.,19 who implementedcommunicationwith only nearest neighbors. In GROMACS 4 we have extended this method for communication with multiple cells and staggered grids for dynamic load balancing.The Shaw group has since chosen to use the midpoint method in their Desmondcode sinceit can take advantageof hardwarewhere each processor has two network connections that simultaneously send and receive. After quite stimulatingdiscussions with the Shaw group we chose not to switch to the midpoint method, primarily not only because we avoid the calculation of the midpoint, which has to be determined binary identically on multiple processors, but also because not all hardware that GROMACS will run on has two network connections.With only one network connection,a single pair of send and receive calls clearly causes less latency than two such pairs of calls.

Before going into the description of the algorithm, the concept of charge groups needs to be explained; these were originally introduced to avoid electrostatic artifacts. By groupingseveral partiallycharged atoms of a chemicalgroup into a neutral charge group, charge-charge interactions entering and leaving the cutoff are effectively replaced by short-range dipole-dipole interactions. The location of a charge group in GROMACS is given by the (non-massweighted) average of the coordinates of the atoms. With the advent of the PME electrostatics method this is no longer an issue. But charge groups can also speed up the neighbor search by an order of magnitude; given a pair of water molecules for instance, we only need to determine one distance instead of nine (or sixteen for a four-site water model). This is particularly important in GROMACS since the neighbor searching is much slower than the force loops, for whichwe typicallyuse tunedassemblycode.Sincecharge groups are used as the basic unit for neighbor searching, they also need to be the basic unit for the domain decomposition. In GROMACS4, the domainsare rebuilteverytimeneighbor searching is performed, typically every 10 steps.

The division of the interactions among processors is illustrated in Figure 1. Consider the processor or cell that has the charge groups in zone 0 as home charge groups, i.e., it performs the integration of the equations of motion for the particles in these charge groups. In the eighth shell method each cell should determine the interactions between pairs of charge groups of which, for each dimension, the minimum cell index of the two charge groups corresponds to the index of that cell. This can be accomplished by the following procedure. Cell 0 receives the coordinates of the particles in the dashed zones 1 to 7, by communication only in one direction for each dimension. When all cells dimensions are larger than the cutoff, each zone corresponds to part of a single, neighboring cell. But in general many cells can contribute to one zone. Each processor calculates the interactions between charge groups of zone 0 with zones 0 to 7, of zone 1 with zones 3 to 6, of zone 2 with zone 5, and of zone 3 with zones 5 and 6. If this procedure is applied for all processors,all pair interactionswithin the cutoff radius are calculated.

Interactions involving three or more atoms cannot be distributedaccordingto the scheme describedabove. Bonded interactions are distributed over the processors by finding the smallest x, y, and z coordinate of the charge groups involved and assigning the interaction to the processor with the home cell where these smallest coordinates residesnote that this does not require any extra communication between the processors. This procedure works as long as the largest distance between charge groups involved in bonded interactions is not larger than the smallest cell dimension. To check if this is the case, we count the number of assigned bonded interactions during domain decomposition and compare it to the total number of bonded interactions in the system. When there are only two cells in a certain dimension and the corresponding box length is smaller than four times the cutoff distance, a cutoff criterion is required for any pair of particles involved to avoid that bonded interactions are assigned to multiple cells. Unlike the midpoint method, this procedurelimitsthe distancesinvolvedin bondedinteractions to the smallest cell dimension. For atomistic simulations this is not an issue, since distances in bonded interactions are usually smaller than 0.5 nm, leading to a limit of 10 to 20 atoms per cell, which is beyond the scaling of GROMACS 4. For coarse-grained simulations bonded distances can be larger, but because of the lower interaction density this also does not limit the scaling.

For full dynamic load balancing the boundaries between cells need to be adjusted during the simulation. For 1D

Figure 1. A nonstaggered domain decomposition grid of 3 2 2 cells. Coordinates in zones 1 to 7 are communicated to the corner cell that has its home particles in zone 0. rc is the cutoff radius.

GROMACS 4 J. Chem. Theory Comput., Vol. 4, No. 3, 2008 437

domain decomposition this is trivial, but for a 3D decomposition the cell boundaries in the last two dimensions need to be staggered along the first dimensions to allow for complete load balancing (see the next section for details). Figure 2 shows the communicated zones for 2D domain decomposition in the most general case, namely a triclinic unit cell with dynamic load balancing. Zones 1, 2, and 3 indicate the parts of neighboring cells that are within the nonbonded cutoff radius rc of the home cell of zone 0. Without dynamic load balancing this is all that would need to be communicated to the processor of zone 0. With dynamic load balancing the staggering can lead to an extra volume 3′ that needs to be communicated, due to the nonbonded interactions between cells 1 and 3 which should be calculated on the processor of cell 0. For bonded interactions, zones 1 and 2 might also require expansion. To ensure that all bonded interaction between charge groups can be assigned to a processor, it is sufficient to ensure that the charge groups within a sphere with a radius rb, the cutoff for bonded interactions, are present on at least one processor for every possiblecenter of the sphere.In Figure 2 this means we also need to communicate volume 2′. When no bonded interactionsare presentbetweenchargegroups,such volumes are not communicated. For 3D domain decomposition the picture becomes quite a bit more complicated, but the procedure is analogous apart from more extensive bookkeeping. All three cases have been fully implemented for general triclinic cells. GROMACS 4 does not (yet) take full advantage of the reduction in the communication due to rounding of the zones. Currently zones are only rounded in the ‘forward’ directions, for example part 3′ in Figure 2 is replaced by the smallest parallelogram enclosing it.

The communication of the coordinates and charge group indices can be performed efficiently by ‘pulsing’ the information in one direction simultaneously for all cells one or more times. This needs to be repeated for each dimension.

The numberof pulsesnp in a dimensionis given by the cutoff length in that direction divided by the minimum cell size.

In most cases np will be one or two. Consider a 3D domain decomposition where we decompose in the order x, y, z; meaning that the x boundaries are aligned, the y boundaries are staggered along the x direction, and the z boundaries are staggered along the x and y directions. Each processor first sends the zone that its neighboring cell in -z needs to this cell. This process is done np(z) times. Now each processor can send the zone its neighboring cell in -y needs, plus the part of the zone it received from +z, that is also required by the neighbor in -y. The last step consists of np(x) pulses in -x where (parts of) 4 zones are sent over. In this way np(x) + np(y) + np(z) communication steps are required to by the neighboring processors. The communication of the forces happens according to the same procedure but in reversed order and direction.

Another example of a minor complication in the communicationis virtual interactionsites constructedfrom atoms in other charge groups. This is used in some polymer (anisotropic united atom) force fields, but GROMACS can also employ virtual sites to entirely remove hydrogen vibrations and construct the hydrogens in their equilibrium positions from neighboring heavy atoms each time step.13 Since the constructing atoms are not necessarily interacting on the same node, we have to track the virtual site coordinate dependenciesseparately to make sure they are both available for construction and that forces are properly communicated back. The communication for virtual sites is also performed with pulses but now in both directions. Here only one pulse per dimension is required, since the distances involved in the constructionof virtual sites are at most two bond lengths.

I. Dynamic Load Balancing

Calculating the forces is by far the most time-consuming part in MD simulations.In GROMACS,the force calculation is preceded by the coordinate communication and followed by the force communication. We can therefore balance the load by determining the time spent in the force routines on each processor and then adjusting the volume of every cell in the appropriatedirection.The timingsare determinedusing inline assembly hardware cycle counters and supported for virtually all modern processor architectures. For a 3D decompositionwith order x, y, z the load balancingalgorithm works as follows: First the timings are accumulated in the z direction to the processor of cell z ) 0, independently for each x and y row. The processor of z ) 0 sums these timings and sends the sum to the processor of y ) 0. This processor sums the timings again and sends the sum to the processor of x ) 0. This processor can now shift the x boundaries and send these to the y ) 0 processors. They can then determine the y boundaries, send the x and y boundaries to the z ) 0 processors, which can then determine z boundaries, and send all boundaries to the processors along their z row. With this procedure only the necessary information is sent to the processors that need it and global communicationis avoided.

As mentioned in the Introduction, load imbalance can come from several sources. One needs to move boundaries in a conservative fashion in order to avoid oscillations and instabilities, which could for instance occur due to statistical fluctuations in the number of particles in small cells. Empirically, we have found that scaling the relative lengths of the cells in each dimension with 0.5 times the load

Figure 2. The zones to communicate to the processor of zone 0, see the text for details. rc and rb are the nonbonded and bonded cutoff radii, respectively, and d is an example of a distance between following, staggered boundaries of cells.

438 J. Chem. Theory Comput., Vol. 4, No. 3, 2008 Hess et al.

imbalance,and a maximumscalingof 5%, producedefficient and stable load balancing. For large numbers of cells or inhomogeneous systems two more checks are required: A first restrictionis that boundaries should not move more than halfway an adjacentcell (where instead of halfway one could also choose a different value). This prevents cells from moving so far that a charge group would move two cells in a single step. It also prevents load balancing issues when there are narrow zones of high density in the system. A second problem is that due to the staggering, cell boundaries along neighboring rows could shift to such an extent that additional cells would enter the cutoff radius. For load balanced simulations the user can set the minimum allowed cell size, and by default the nonbonded cutoff radius is used. The distance between following, staggered cell boundaries (as indicated by d in Figure 2) should not be smaller than this minimum allowed cell size. To ensure this, we limit the new position of each boundary to the old limit plus half the old margin. In this way we make sure that one boundary can move up and independently an adjacent staggered boundarycan move down,withoutextracommunication.The neighboringboundaries are communicatedafter load balancing, since they are needed to determine the zones for communication. When pressure scaling is applied, the limits are increased by 2% to allow the system to adjust at the next domain decomposition before hitting the cutoff restrictions imposed by the staggering.

In practical tests, load imbalances of a factor of 2 on several hundreds of processors were reduced to 2% after a few load balancingsteps or a couple of secondsof simulation time.

IV. Parallel Holonomic Constraints

There are two strong reasons for using constraints in simulations: First, a physical reason that constraints can be considered a more faithful representation of chemical bonds in their quantum mechanical ground state than a classical harmonic potential. Second, a practical reason because rapid bond vibrations limit the time step. Removing these vibrations by constraining the bonds thus allows us to increase the time step and significantly improve absolute simulation performance.A frequentlyused rule-of-thumbis 1 fs without constraints, 1.4 fs with bonds to hydrogens constrained, and 2 fs when all bonds are constrained. Unfortunately, the common SHAKE1 constraint algorithm is iterative and therefore not very suitable for parallelizationsin fact, there has previously not been any efficient algorithm that could handle constraints connected over different processors due to domain decomposition. Most biomolecular packages thereforeuse constraintsonly for bonds involvinghydrogens.

By default, GROMACS uses a noniterative constraints algorithm called LINear Constraint SolVer (LINCS), which proved much easier to fully parallelize as hinted already in the original paper.20 The details of the parallel LINCS algorithm P-LINCS are described elsewhere,21 so we will only give a brief overview here. In the algorithm, the range of influence of coupled constraints is set by the order of the expansion for the matrix inversion. It is only necessary to communicate a subset of the old and new unconstrained coordinates between neighboring cells before applying the constraints. The atoms connected by up to “one plus the expansion order” bonds away need to be communicated. We can then constrain the local bonds plus the extra bonds. The communicated atoms will not have the final correctly constraint positions (since they interact with additional neighbors), but the local atoms will. The beauty of the algorithm is that normal molecular simulation only requires a first, linear correction and a single iterative step. In both these steps updated positions are communicated and adjustment forces calculatedlocally.The constraintcommunication can be accomplished with a single forward and backward pulse of the decomposition grid in each dimension, similar to the domain decomposition communication. The results of P-LINCS in GROMACS are binary identical to those of the single processor version.

In principle a similar method could be used to parallelize other constraint algorithms. However, apart from multiple communicationsteps for iterativemethods such as SHAKE,1 another problem is that one does not know a priori which atoms need to be communicated, because the number of iterations is not fixed. To our best knowledge, this is the first efficient implementation of an holonomic constraint algorithm for domain decomposition.21

(Parte 1 de 4)