**UFRJ**

# Quantum espresso

(Parte **4** de 7)

4.7. XSPECTRA

The XSPECTRA code allows for the calculation of K-edge X-ray absorption spectra (XAS). The code calculates the XAS cross-section including both dipolar and quadrupolar matrix elements. The code uses the self-consistent charge density produced by PWscf and acts as a post-processing tool. The all-electron wavefunction is reconstructed using the PAW method and its implementation in the GIPAW code. The presence of a core-hole in the final state of the X-ray absorption process is simulated by using a pseudopotential for the absorbing atom with a hole in the 1s state. The calculation of the charge density is performed on a supercell with one absorbing atom. From the self-consistent charge density, the X-ray absorption spectra are obtained using the Lanczos method and a continued fraction expansion [65, 152]. The advantage of this approach is that once the charge density is known it is not necessary to calculate empty bands to describe very high energy features of the spectrum. Correlation effects can be simulated in a mean-field way using the Hubbard U correction [86] that has been included in the XSPECTRA code in Ref. [153]. Currently the code is limited to collinear magnetism. Its extension to noncollinear magnetism is under development.

Wannier90 [26, 154] is a code that calculates maximally-localized Wannier functions in insulators or metals—according to the algorithms described in Refs. [61, 62]—and a number of properties that can be conveniently expressed in a Wannier basis. The code is developed

QUANTUM ESPRESSO 17 and maintained independently by a Wannier development group [154] and can be taken as a representative example of the philosophy described earlier, where a project maintains its own individual distribution but provides full interoperability with the core components of QUANTUM ESPRESSO in this case PWscf or CP. These codes are in fact used as “quantum engines” to produce the data onto which Wannier90 operates. The need to provide transparent protocols for interoperability has in turn facilitated the interfacing of wannier90 with other quantum engines [21, 14], fostering a collaborative engagement with the broader electronic-structure community that is also in the spirit of QUANTUM ESPRESSO .

Wannier90 requires as input the scalar products between wavefunctions at neighboring k-points, where these latter form uniform meshes in the Brillouin zone. Often, it is also convenient to provide scalar products between wavefunctions and trial, localized real-space orbitals—these are used to guide the localization procedure towards a desired, physical minimum. As such, the code is not tied to a representation of the wavefunctions in any particular basis—for PWscf and CP a post-processing utility is in charge of calculating these scalar products using the plane-wave basis set of QUANTUM ESPRESSO and either NC-PPs or US-PPs. Whenever Γ sampling is used, the simplified algorithm of Ref. [155] is adopted.

Besides calculating maximally localized Wannier functions, the code is able to construct the Hamiltonian matrix in this localized basis, providing a chemically accurate, and transferable, tight-binding representation of the electronic structure of the system. This, in turn, can be used to construct Green’s functions and self-energies for ballistic transport calculations [156, 157], to determine the electronic structure and DOS of very large scale structures [157], to interpolate accurately the electronic band structure (i.e. the Hamiltonian) across the Brillouin zone [157, 158], or to interpolate any other operator [158]. These latter capabilities are especially useful for the calculation of integrals that depend sensitively on a submanifold of states; common examples come from properties that depend sensitively on the Fermi surface, such as electronic conductivity, electron-phonon couplings Knight shifts, or the anomalous Hall effect. A related by-product of Wannier90 is the capability of downfolding a selected, physically significant manifold of bands into a minimal but accurate basis, to be used for model Hamiltonians that can be treated with complex many-body approaches.

4.9. PostProc

The PostProc module contains a number of codes for post-processing and analysis of data files produced by PWscf and CP. The following operations can be performed:

• Interfacing to graphical and molecular graphics applications. Charge and spin density, potentials, ELF [68] and STM images [67] are extracted or calculated and written to files that can be directly read by most common plotting programs, like xcrysden [159] and VMD [160].

• Interfaces to other codes that use DFT results from QUANTUM ESPRESSO for further calculations, such as e.g.: pw2wannier90, an interface to the wannier90 library and code [26, 154] (also included in the QUANTUM ESPRESSO distribution); pw2casino.f90, an interface to the casino quantum Monte Carlo code [161];

QUANTUM ESPRESSO 18 wannier_ham.f90, a tool to build a tight-binding representation of the KS Hamiltonian to be used by the dmft code [162] (available at the qe-forge site); pw_export.f90, an interface to the GW code SaX [163]; pw2gw.f90, an interface to code DP [164] for dielectric property calculations, and to code EXC [165] for excitedstate properties.

• Calculation of various quantities that are useful for the analysis of the results. In addition to the already mentioned ELF and STM, one can calculate projections over atomic states (e.g. Löwdin charges [69]), DOS and Projected DOS (PDOS), planar and spherical averages, and the complex macroscopic dielectric function in the random-phase approximation (RPA).

Figure 2. Snapshot of the PWgui application. Left: PWgui’s main window; right: preview of specified input data in text mode.

4.10. PWgui

PWgui is the graphical user interface (GUI) for the PWscf, PHonon, and atomic packages as well as for some of the main codes in PostProc (e.g. p.x and projwfc.x). PWgui is an input file builder whose main goal is to lower the learning barrier for the newcomer, who has to struggle with the input syntax. Its event-driven mechanism automatically adjusts the display of required input fields (i.e. enables certain sets of widgets and disables others) to the specific cases selected (see Fig. 2, left panel). It enables a preview of the format of the (required) input file records for a given type of calculation (see Fig. 2, right panel). The input files created by PWgui are guaranteed to be syntactically correct (although they can still be physically meaningless). It is possible to upload previously generated input files for syntax checking and/or to modify them. It is also possible to run calculations from within the PWgui. In addition, PWgui can also use the external xcrysden program [159] for the visualization of molecular and/or crystal structures from the specified input data and for the visualization of properties (e.g. charge densities or STM images).

QUANTUM ESPRESSO 19

Table 1. Summary of parallelization levels in QUANTUM ESPRESSO. group distributed quantities communications performance image NEB images very low linear CPU scaling, fair to good load balancing; does not distribute RAM pool k-points low almost linear CPU scaling, fair to good load balancing; does not distribute RAM plane-wave plane waves, G-vector high good CPU scaling, coefficients, R-space good load balancing, FFT arrays distributes most RAM task FFT on electron states high improves load balancing linear algebra subspace Hamiltonians very high improves scaling, and constraints matrices distributes more RAM

As the QUANTUM ESPRESSO codes evolve, the input file syntax expands as well. This implies that PWgui has to be continuously adapted. To effectively deal with such issue, PWgui uses the GUIB concept [166]. GUIB builds on the consideration that the input files for numerical simulation codes have a rather simple structure and it exploits this simplicity by defining a special meta-language with two purposes: the first is to define the input-file syntax, and the second is to simultaneously automate the construction of the GUI on the basis of such a definition.

A similar strategy has been recently adopted for the description of the QUANTUM

ESPRESSO input file formats. A single definition/description of a given input file serves i) as a documentation per-se, i) as a PWgui help documentation, and ii) as a utility to synchronize the PWgui with up-to-date input file formats.

Keeping the pace with the evolution of high-end supercomputers is one of the guiding lines in the design of QUANTUM ESPRESSO, with a significant effort being dedicated to porting it to the latest available architectures. This effort is motivated not only by the need to stay at the forefront of architectural innovation for large to very-large scale materials science simulations, but also by the speed at which hardware features specifically designed for supercomputers find their way into commodity computers.

The architecture of today’s supercomputers is characterized by multiple levels and layers of inter-processor communication: the bottom layer is the one affecting the instruction set of a single core (simultaneous multithreading, hyperthreading); then one has parallel processing at processor level (many CPU cores inside a single processor sharing caches) and at node level (many processors sharing the same memory inside the node); at the top level, many nodes are finally interconnected with a high-performance network. The main components of the

QUANTUM ESPRESSO 20

QUANTUM ESPRESSO distribution are designed to exploit this highly structured hardware hierarchy. High performance on massively parallel architectures is achieved by distributing both data and computations in a hierarchical way across available processors, ending up with multiple parallelization levels [167] that can be tuned to the specific application and to the specific architecture. This remarkable characteristic makes it possible for the main codes of the distribution to run in parallel on most or all parallel machines with very good performance in all cases.

More in detail, the various parallelization levels are geared into a hierarchy of processor groups, identified by different MPI communicators. In this hierarchy, groups implementing coarser-grained parallel tasks are split into groups implementing finer-grained parallel tasks.

The first level is image parallelization, implemented by dividing processors into nimage groups, each taking care of one or more images (i.e. a point in the configuration space, used by the NEB method). The second level is pool parallelization, implemented by further dividing each group of processors into npool pools of processors, each taking care of one or more k-points. The third level is plane-wave parallelization, implemented by distributing real- and reciprocal-space grids across the nPW processors of each pool. The final level is task group parallelization [168], in which processors are divided into ntask task groups of nFFT = nPW/ntask processors, each one taking care of different groups of electron states to be Fourier-transformed, while each FFT is parallelized inside a task group. A further paralellization level, linear-algebra, coexists side-to-side with plane-wave parallelization, i.e. they take care of different sets of operations, with different data distribution. Linear-algebra parallelization is implemented both with custom algorithms and using ScaLAPACK [169], which on massively parallel machines yield much superior performances. Table 1 contains a summary of the five levels currently implemented. With the recent addition of the two last levels, most parallelization bottlenecks have been removed, while both computations and data structures are fully distributed.

This being said, the size and nature of the specific application set quite natural limits to the maximum number of processors up to which the performances of the various codes are expected to scale. For instance, the number of k−points calculation sets a natural limit to the size of each pool, or the number of electronic bands sets a limit for the parallelization of the linear algebra operations. Moreover some numerical algorithms scale better than others. For example, the use of norm-conserving pseudopotentials allows for a better scaling than ultrasoft pseudopotentials for a same system, because a larger plane-wave basis set and a larger real- and reciprocal-space grids are required in the former case. On the other hand, using ultrasoft pseudopotentials is generally faster because the use of a smaller basis set is obviously more efficient, even though the overall parallel performance may not be as good.

Simulations on systems containing several hundreds of atoms are by now quite standard (see Fig. 3 for an example). Scalability does not yet extend to tens of thousands of processors as in especially-crafted codes like QBox [170], but excellent scalability on up to 4800 processors has been demonstrated (see Fig. 4) even for cases where coarse-grained parallelization does not help, using only MPI parallelization. We remarks that the results for CNT (2) in Fig. 4 were obtained with an earlier version of the CP code that didn’t use

QUANTUM ESPRESSO 21

0 100 200 300 400 500 CPU time (s)

Number of CPUs scalability for < 1000 processors BG/P 1 task BG/P 4 tasksBG/P 8 tasksAltix 1 task Altix 4 tasksAltix 8 tasks

32 64 128 256 512 Speedup

Number of CPUs

Ideal scaling slope

Figure 3. Scalability for medium-size calculations (CP code). CPU time (s) per electronic time step (left panel) and speedup with respect to 32 processors (right panel) as a function of the number of processors and for different numbers ntask of task groups, on a IBM BlueGene/P (BG/P) and on a SGI Altix. The system is a fragment of an Aβ−peptide in water containing

838 atoms and 2311 electrons in a 2.1 × 2.9 × 19.9 Å3 cell, ultrasoft pseudopotentials, Γ point, 25 Ry and 250 Ry cutoff for the orbitals and the charge density respectively.

ScaLAPACK; the current version performs better in terms of scalability.

0 1000 2000 3000 4000 5000 Wall time (s)

Number of CPUs

Scalability for > 1000 processors PSIWAT CNT (1) CNT (2)

256 512 1024 2048 4096 Speedup

Number of CPUs

Ideal scaling slope

Figure 4. Scalability for large-scale calculations: Wall time (left panel) and speedup (right panel) as a function of the number of processors. PSIWAT: PWscf code, npool = 4,

PWscf code, ntask = 4, on a Cray XT 4. The system is a porphyrin-functionalized nanotube, Γ point, 1532 atoms, 5232 electrons. CNT (2): CP code on a Cray XT3, same system as for CNT (1), Times for PSIWAT and CNT (1) are for 10 and 2 self-consistency iterations, respectively; times for CNT (2) are for 10 electronic steps plus 1 Car-Parrinello step, divided by 2 so that they fall in the same range as for CNT (1).

The efforts of the QUANTUM ESPRESSO developers’ team are not limited to the performance on massively parallel architectures. Special attention is also paid to optimize the performances for simulations of intermediate size (on systems comprising from several tens to a few hundreds inequivalent atoms), to be performed on medium-size clusters, readily

(Parte **4** de 7)