**Divide and Conquer Hartree-Fock Calculations on proteins**

Divide and Conquer Hartree-Fock Calculations on proteins

(Parte **1** de 3)

Divide and Conquer Hartree-Fock Calculations on Proteins

Xiao He and Kenneth M. Merz, Jr.*

Department of Chemistry and the Quantum Theory Project, 2328 New Physics Building, P.O. Box 118435, UniVersity of Florida, GainesVille, Florida 32611-8435

Received December 9, 2009

Abstract: The ability to perform ab initio electronic structure calculations that scale linearly with the system size is one of the central aims in theoretical chemistry. In this study, the implementation of the divide and conquer (DC) algorithm, an algorithm with the potential to aid the achievement of true linear scaling within Hartree-Fock (HF) theory, is revisited. Standard HF calculations solve the Roothaan-Hall equations for the whole system; in the DC-HF approach, the diagonalization of the Fock matrix is carried out on smaller subsystems. The DC algorithm for HF calculations was validated on polyglycines, polyalanines, and 1 real threedimensional proteins of up to 608 atoms in this work. We also found that a fragment-based initial guess using the molecular fractionation with conjugated caps (MFCC) method significantly reduces the number of SCF cycles and even is capable of achieving convergence for some globular proteins where the simple superposition of atomic densities (SAD) initial guess fails.

Introduction

Ab initio quantum mechanicalmethods have been developed over the past several decades and successfully applied to the study of the chemical properties for small- to moderate-sized molecules. The routine application of these full quantum mechanical calculations on macromolecules (molecules containing greater than 500 atoms) continues to pose a great challenge for theoretical chemists. The major limitation of ab initio methods is the scaling problem, since the computational cost of ab initio methods increases considerably as the size of the molecule increases. For instance, Hartree- Fock (HF)1 and density functional theory (DFT)2 scale as O(N4), post-Hartree-Fock MP23 scales as O(N5), and the coupled cluster(C)4-9 method that includes single and double excitations (CCSD) scales as O(N6). In modern HF calculations, the computational cost for the 2-electron integrals can be reduced from O(N4)t oO (N2) using a simple screening technique.10 Hence, the dominant step for large molecules becomes the matrix diagonalization, which scales as O(N3). In this study, our goal was to reduce the computationalcost of the diagonalizationstep in HF calculations to linear with system size.

The state-of-the-art linear-scaling algorithms, which make the computational cost scale linearly O(N) with the system size, have attracted the focus of many theorists during the past decade.1-21 Much effort has been devoted to the development of linear-scaling methods in order to compute the total energy of large molecular systems at the Hartree- Fock(HF)or densityfunctionaltheory(DFT)level.12,15,18,2-26 One of the challenges is to speed up the calculation of longrange Coulomb interactions when assembling the Fock matrix elements. Fast multipole-based approaches have successfullyreducedthe scalingin systemsize to linear14,16-18,25 and made HF and DFT calculations affordable for larger systemswhen small- to moderate-sizedbasis sets are utilized. The more recently developed Fourier transform Coulomb method of Fusti and Pulay27,28 reduced the steep O(N4) scaling in basis set size to quadratic and makes the calculations much more affordable with larger basis sets.29 There is also a class of fragment-based methods for quantum calculation of protein systems including the divide and conquer (DC) method of Yang,2 Yang and Lee,23 Dixon and Merz,30 Gogonea et al.,31 Shaw and St-Amant,32 and Nakai and co-workers,3-36 the adjustable density matrix assembler (ADMA) approach method of Exner and Mezey,26,37-39 the fragmentmolecularorbital(FMO)method of Kitaura and co-workers,13,40,41 and the molecular frac-

* Correspondence author phone: 352-392-6973; fax: 352-392- 8722, e-mail: merz@qtp.ufl.edu.

J. Chem. Theory Comput. X, x, 0 A

10.1021/ct9006635 X American Chemical Society

tionation with conjugate caps (MFCC) approach developed by Zhang and co-workers.42,43 Most applications of these methods to protein systems have been largely limited to semiempirical, HF, and DFT calculations. Among these approaches, FMO has been applied to higher level ab initio calculations such as second-order Møller-Plesset perturbation theory(MP2)4 and coupledclustertheory(C).45 Nakai and co-workers recently proposed DC-MP233,36,46 and DCCCSD47 approaches; however, only systems of linear chains or near-linear chains have been tested so far for the divide and conquer algorithm for ab initio calculations.

In the DC algorithm, the total system is divided into small fragments. Atoms within adjustable buffer regions surrounding each fragmentare includedin the calculationsto preserve the chemical environment of the divided subsystem. A set of local Roothaan-Hall equations is then solved for each subsystem, and an approximate full density matrix of the entire molecular system is built up from subsystem contributions. By solving the HF self-consistent field (SCF) equation iteratively, the final converged full density matrix is used to obtain the total energy of the entire system. In this manner, linear scaling of the Fock matrix diagonalization step is achievedas a result of the fact that a set of smaller subsystem Fock matrices is diagonalized in the DC-HF approach rather than the global Fock matrix diagonalization for traditional HF calculations. Furthermore, divide and conquer calculations may be efficiently parallelized because the individual subsystem calculations are solved separately. In the DC-HF approach, the memory usage will increase linearly as the size of the system increases, which is also an important feature of this approach.

The aim of our current research is to further develop and validate the divide and conquer (DC)2,23,30,32,46-48 methodology to aid in the application of ab initio methods to biomacromolecules. In this study, our goal is to validate the divide and conquer algorithm for Hartree-Fock calculations on globularproteins.Moreover,we proposea fragment-based

Figure 1. Graphical representationof the subsetting scheme used in divide and conquer calculations.

Figure 2. MFCC scheme in which the peptide bond is cut

(a) and the fragments are capped with Ccap and its conjugate Ccap* (b). (c) Atomic structure of the concap. The concap is defined as the fused molecular species of Ccap*-Ccap.

Figure 3. Subsetting schemes for divide and conquer calculations on the extended polyglycine (Gly)n(upper) and polyalanine in an R-helical structure (R-(Ala)n, bottom).

Figure4. Averagecomputationaltimeto diagonalizethe Fock matrix in each SCF cycle using traditional HF and DC-HF for a series of extended polyglycines at the HF/6-31G* level.

B J. Chem. Theory Comput., Vol. x, No. x, X He and Merz

initial guess using molecular fractionation with conjugated caps (MFCC) method to reduce the number of SCF cycles, and different division schemes are compared.

Computational Approaches

Divide and Conquer Approach on the Hartree-Fock

Calculations. In protein systems, the divide and conquer approach is based on the chemical locality; this assumes that local regions of a protein are only weakly influenced by the atomsthat are far away from the regionof interest.The whole system is divided into fragments called core regions (CoreR). A buffer region (BufferR) is assigned for each core region to account for the environmental effects. The combination of every core region and its buffer region constitutes each individual subsystem (R) as illustrated in Figure 1. Local MOs of each subsystem are solved by the Roothaan-Hall equation where FR and SR are local Fock matrix and local overlap matrix, respectively.

After the local MO coefficient matrices CR are obtained, the total density matrix of the whole system is given by where DµνR is the partition matrix and pµνR is the local density matrix defined by where niR is a smooth approximation to the Heaviside step function εF is determined through the normalization of the total number of electrons of the whole system

After the density matrix is converged, the total HF energy is given as where HµνR is the local one-electron core Hamiltonian matrix similar to the definition of local Fock matrix in eq 2.

For HF calculations on large systems, the construction of the Coulomb matrix and exchange matrix along with the diagonalization of the Fock matrix constitute the three most time-consuming steps. The Hamiltonian matrix diagonalization intrinsically scales as O(N3). In the divide and conquer scheme the diagonalization calculation is performed on each submatrix, which will naturally make the SCF diagonalization step scale linearly with the number of subsystems. However,it is importantto realizethat the divideand conquer algorithm does not help to reduce the scale of computation of the Coulomb matrix and exchange matrix. The continuous fast multipolemethod (CFMM)14,16-18,25,49-51 and the linear exchange K approach (LinK)52,53 provide ways in which the Coulomb matrix and exchange matrix can be made linear scaling, respectively.

Figure 5. Accuracy of the total energy calculated by the DCHF approach on extended polyglycine systems compared to full system calculations.

Figure 6. Similar to Figure 4 but for the polyalanine systems in an R-helical structure R-(Ala)n.

FRCR ) SRCRER (1)

Nsub

Nsub DµνR pµνR (3)

DµνR ) {1 φµ ∈ CoreR and φν ∈ CoreR φµ ∈ BufferR and φν ∈ CoreR

LMOs niRCµiRCνi R* (5)

Divide and Conquer Hartree-Fock Calculations J. Chem. Theory Comput., Vol. x, No. x, X C

MFCC InitialGuess.Here we introducea fragment-based initial guess for ab initio calculations using the molecular fractionation with conjugate caps (MFCC) algorithm as described elsewhere.42,54,5 In the spirit of the MFCC approach, the full density matrix of the protein can be assembled by a linear combination of fragment density matrices

where Pµνf (i) is the density matrix element of the ith protein fragment and Pµνc(j) is the density matrix element of the jth conjugate cap. Nf and Nc are the total number of the protein fragments and conjugate caps, respectively. The MFCC partition scheme to cut a protein into amino acid fragments and conjugate caps is shown in Figure 2. First, a series of single-point HF calculations on all the fragments and conjugate caps are performed; then the full density matrix of the protein obtained using the converged fragment density matrices based on eq 9 is taken as the initial guess for the subsequent divide and conquer HF calculations. All ab initio calculations were implemented in an in-house-developed quantum chemistry package QUICK.56

Results and Discussion

Accuracy and Timing Comparisons. In this section we assess the DC-HF approach performance by performing calculations on two types of simple systems: extended polyglycine (gly)n and an R-helix of polyalanine (R -(ala)n, see Figure 3). All calculationsdiscussed here use the 6-31G* basis set. A buffer radius of Rb ) 5.0 Å was adopted for all DC-HF calculations. Within this definition we include all the residues that contain any atoms within 5 Å from the core region as part of the buffer region. A comparison of the CPU time required to solve the SCF equations on the extended polyglycine (gly)n (n ) 6-40) using the standard HF and DC-HF approaches is shown in Figure 4. As expected, the computational scale for the DC-HF diagonalization calculation is O(N), in contrast to O(N2.9) for the traditional HF SCF diagonalization on the full Fock matrix of the entire system. Moreover, since the polyglycine is extended, the basis set cross-over point is between 485 and 749. Figure 5 shows the deviation of DC-HF energy compared to the full system calculation on extended polyglycine systems. The error becomes larger as the size of the system increases; however, all of the deviations are within 0.04 kcal mol-1 .

This good accuracy suggests that we can employ the divide and conquer scheme to study large, 3-dimensional systems.

The computational cost and accuracy of DC-HF for R-(ala)n (n ) 10-40) systems are illustrated in Figures 6 and 7, respectively. Because of the helix structure, each subsystem contains a larger number of residues than in the extended system using the same buffer size. As illustrated in Figure 6, the cross-over point is around 1789, which is over 2 times larger than for the polyglycine example. Each DC-HF diagonalization SCF cycle in this example scales as O(N1.1), in contrast to O(N2.7) for the traditional HF diagonalization cost. Furthermore, the total energy errors for the R-helical polyalanines are slightly larger than those for the extended polyglycine systems (see Figure 7), but they are still in a good agreement with the full system calculations since the largest error is less than 0.7 kcal mol-1 for R-(ala)40. In the current DC-HF approach,the scale for the computa- tion of the Coulomb matrix is still O(N2) after prescreening

Figure 7. Similar to Figure 5 but for the polyalanine systems in an R-helix structure R-(Ala)n.

Table 1. Number of SCF Cycles Needed To Reach Convergence for the SAD and MFCC Initial Guess at the HF/6-31G* Level

DC non-DCa system

SAD initialguess

MFCC initialguess SAD initialguess

MFCC initial guess a In the SCF procedure of the non-DC case every 10 previous

Fock matrices were stored in the DIIS calculations, while for the DC case every 2 previous Fock matrices were stored in the DIIS calculations until the root-mean-squared change of the density matrix elements reaches 10-4 au, after which the DIIS technique was turned off.

Table 2. Converged Total Energies (au) (at the HF/6-31G* level) Using Two Different Subsetting Schemes: Residue Based with a Buffer of 5 Å and Atom Based with a Buffer of7Å a system residue-centric core region atom-centric core region deviation (kcal mol-1) a MUD: mean unsigned deviation.

D J. Chem. Theory Comput., Vol. x, No. x, X He and Merz

(Parte **1** de 3)