Parallel Algorithms in Computational Chemistry

````````````````````````````````````````````````````````````````````````````````````````````````````
The material here have been
gathered and written by
Dr. Roland Lindh
Dept. of Chemical Physics
University of Lund, Sweden.
 
Audience: students at NGSSC (National Graduate School of Scientific Computing) in Sweden. These student are not necessarily familiar with computational chemistry or chemistry in general for that matter.
Objective: to introduce the students to some of the course grain parallelization used in computational chemistry. This is NOT and I repeat NOT an attempt to teach or present the subject of quantum chemistry (a matter which hardly can even be properly introduced by two lectures).

The purpose of the two following lectures is to give a brief introduction to parallel implementations in ab initio theory.  The key operations listed below are the most important procedures in ab initio and semi-empirical theory and have been the subject to parallel implementations for the last 15 years and will be the focus of these lectures.

 
  • Fock matrix construction
  • Integral transformation
  • Sigma vector construction
  •  

    The nature of ab initio calculations with respect to the possibility of course grain parallelization will be addressed. The first lecture will give some brief theoretical background to theory in ab initio. This will form the frame work in which we will base our discussion on how to parallelize codes in computational chemistry. The second lecture will be of a more detailed study of strategies for parallelization of the Fock matrix generation and integral transformation.

    The format of the notes is a bit different from conventional overheads in the sense that the must carry more information. The lecturer will together with the overheads give the whole picture. For notes on the web the information has to be more complete making the pages unacceptable in general as overheads. The lectures are planned to be converted to overheads in the near future.
     

     
    Constructive criticisms and comments are welcome to Roland Lindh, all other communication is kindly refered to /dev/null.
     
     

    Table of content

    Lecture1

    Lecture 2 Related program info
    Lecture 1: Introduction
    The object of the following lecture is to give a brief introduction to ab initio techniques. The lecture will lead to a better understanding of the computational challanges in ab initio theory. In particular we want the students to get a technical backround for the future discussions of parallelization of computational chemistry.

    The lecture will introduce the concepts of

     Gaussian Basis Functions
    In the modelling of classical dynamics the fundamental building block in the computer implementation are the finite elements, which are represented as values on a grid. The approach with adaptive grids allow this appoach to model in principal any functional behaviour. In ab initio theory, however, the key building blocks to describe the wave functions and the molecular orbitals are analytical functions. Early in the development of ab initio theory these functions were the so-called Slater functions,
    where r are the electronic coordinates, alpha is the exponetial coefficient, A is the center of the Slater and n is the angular index vector.
     
     
    However, over the last 40 years the use of Slaters have been replaced by the use of Gaussians,

    the reason for this transition lies in the computaional advantages of Gaussians vs. Slaters in that the former allows for analytic expressions for the matrix elements which are used in computational chemistry. One of the fundamental properties of Gaussians is manifested in the Gaussian product theorem (GPT) which states that the product of two Gaussians can be expressed as a new Gaussian.
     

    where

     
     
     
     

    or expressed graphically

     
    In particular we should observe that the exponential prefactor of the new Gaussian in a very dramatical way reduce the significance of the product as the two parent functions separate. This is of essential nature to identify and to use to limit the calculation of matrix elements which will have values significant to the investigation (i.e. we do not want to use an elaborate procedure to just find out that the element anyway was just zero when we know that already before starting the calculation).
    Slater v.s. Gaussian
    The Slater functions are exact wave functions for hydrogen-like atoms in the non-relativistic case. Hence, they would serve as a good basis to expand the one-particle basis. However, due to the computational complications of using these functions directly they were expressed in early implementations as linear combinations of Gaussians. Nowadays the one-particle basis sets are expressed directly in Gaussians without any reference to the Slater functions.
     
     
    Contracted Basis Sets

    To reduce the number of parameters to optimize in the ab initio procedures one usually contract the primitive Gaussians to contracted Gaussian functions as
     

     

    the contracted basis is normally refered to as the atomic orbital (AO) basis. This basis will later be used to construct the molecular orbitals  (MOs) as they are defined in the computational models. The contractions schemes can be divided into two groups, namely the segmented contraction scheme

     

    and the general contraction scheme.
     
     

    The former has computational advantages but lacks the high quality of the latter. Standard basis sets today for large systems are of segmented type whereas for smaller system one usually use the general contraction scheme.

     
    The individual basis sets are divided into shells which are defined to include all basis functions of the same total angular momentum (s, p, d, f, etc.) at a given center. Below follows some standard basis sets, their contraction  and number of functions in each shell for the carbon atom
     
     
     
    Basis Prim./Cont. # of functions Type
    STO-3G 6s3p/2s1 2/3 Segmented
    6-31G 10s4p/3s2p 3/6 Segmented
    cc-pVDZ 9s4p/3s2p 3/6 Segmented
    cc-pVTZ 10s5p2d1f/4s3p2d1f 4/9/10/7 Segmented
    cc-pVQZ 12s6p3d2f1g/5s4p3d2f1g 5/12/15/14/9 Segmented
    ANO-L 14s9p4d3f/6s5p3d2f 6/15/15/14 General
     
     
    Basis set are today constructed for most elements in the periodic table. Basis set at various levels of approximation is available on the web (e.g. Pacific North National Laboratory, Washington has a splendid web page where most available basis sets can be downloaded in various formats, http://www.emsl.pnl.gov:2080/forms/basisform.html ).
     
    The Hamiltonian
    A molecular system in the time independent non-relativistic frame is described by the Schrödinger equation
     
     

    where H is the Hamiltonian, Psi is the wave function, and E is the total energy of the system. This equation have the same properties as a typical eigenfunction problem and if the one-particle space is expanded in a finite basis we will observe that operations normally encountered in linear algebra, matrix-matrix, matrix-vector mathematics are of significant importance in computational chemistry. Possibly with the difference that the matrix elements in ab initio theory are so many and so expensive to evaluate that some unconventional use of standard methods actually are the fastest in the computer implementations of ab initio theory.   Whereas the wave function vary for different ab initio models the Hamiltonian in the Born-Oppenheimer approximation is always defined as

     

    where the different terms are, starting from the left, 1) the electronic kinetic energy,  2) the nuclear-electron attraction,  3) the electron-electron repulsion, and 4) the nuclear-nuclear repulsion term. To solve the Schrödinger equation the  matrix elements over these operators need to be computed.  The two first terms are collected in the so-called one-electron term, the third is the so-called the 2-electron term, and the forth is called the nuclear potential energy term. The evaluation of these matrix elements will be, with respect to CPU time, dominated by the 2-electron term. In parallelizing ab initio code special care has to be taken to parallelize this evaluation.
     

    Electron Repulsion Integrals
    The matrix elements over the electron-electron repulsion term is a four index item and is expressed as
     

    These terms are also denoted as the two-electron integrals or electron repulsion integrals (ERIs).  The computational expense to evaluate the integrals scale proportional to N4 (where N is the total size of the one-particle basis set). To reduce the computational cost of the i ntegrals it is important to use the permutational symmetry
     

        (pq|rs)=(qp|rs)=(pq|sr)=(qp|sr)=(rs|pq)=(sr|pq)=(rs|qp)=(sr|qp)
     

     to only compute the unique shell quadruplets. The unique index of a integral batch is formed according to
     

     
    In case of point group symmetry efficient integral program  also only compute the symmetry unique integrals. Much of the efforts in the past have been to develop optimal algorithms to evaluate these integrals. The modern computer codes of today evaluate the integrals in batches of shell quadruplets for efficiency considerations. Going back to the cc-pVTZ basis set of carbon, for example, we have that the following integral classes/batches include the simultaneous evaluation of a large number of  contracted integrals
     
     
    Class # of integrals
    (ss|ss) 256
    (ss|sp) 576
    (ss|pp) 1.296
    (sp|pp) 2.916
    (pp|pp) 6.561
    ... ...
    (dd|dd) 10.000
    ... ...
      The calculation of these integral batches (an integral batch is all the integrals generated from a unique shell quadruplet combination) take considerable time to evaluate. However, the number of such batches is large in a molecular system which is not trivially small. For example, the benzene molecule in the cc-pVTZ basis would include 408.156 integral batches, including symmetry considerations of the molecule would reduce the calculation to about 50.000 unique integral batches. It is clear from these facts that the parallelization of a computer code for the evaluation of two-electron integrals should be parallelized at a course grain level and that the main considerations are
    SCF Theory

    In the Self Consistent Field theory  (Hartree-Fock theory) the wave function is approximated by a Slater-determinant. This determinant has the feature of modeling the interchange of particles which is correct for a Fermi particle system, i.e. a change of sign of the wave function under particle permutation. The optimization of the energy is achieved by minimizing the total energy with respect to the coefficients of the molecular orbitals as they are expressed in the atomic orbital basis (contracted Gaussians). The energy of the SCF wave function is expressed as

     

    where (in closed shell restricted HF)

     

    are the so-called 1st order densities and Epq is the generator of the unitary group. Using orthogonal MOs reduces the 1st order density in closed shell RHF case to

     
     
     
    To optimize the SCF wave functions one computes the Fock matrix
     

    as a function of the current 1st order density and the MO integrals. The Fock matrix is the equivalent to the gradient of the energy with respect to orbital rotations. In particular, in the MO basis the elements between occupied and virtual orbitals should be zero (Brillouin-condition) at convergence.
    The Fock matrix is in the SCF approach diagonalized to give a new MO description. This procedure is repeted until the diagonalization procedure does not yield any significant rotations of the MOs (self consistency). The optimization of the MOs could also of course be formulated as an quasi-Newton step - eliminating  the diagonalization step. This is of importance since the diagonalization procedure is not optimal for parallel implementations.

    The Fock matrix can also be evaluated directly in the AO basis
     

    where the first order density in the AO basis is expressed as

    This eliminate the need for any four-index transformation during the SCF iterations. The Fock matrix in the AO basis is also diagonalized or used in the quasi-Newton step. The only possible difference between the AO and MO scheme is how the rotations of the orbitals are expressed, i.e. are they relative to the old MOs or are they expressed directly in the AOs? The computer implementation will have the following structure

    1. Compute integral and store on disk
    2. Get starting MOs
    3. Generate 1st order densities from MOs
    4. Constructe Fock matrix from 1st order densities
    5. Generate update of  MOs (diagonalization or quasi-Newton step).
    6. If not converged back to step 3
    7. Compute final wave function and energy
    It is important to note that the SCF method is an iterative method. At convergence the MOs are eigenfunctions of the Fock matrix and the associated eigenvalues are denoted the orbital energies. Another important feature of the SCF method is that it can be expressed in the AO basis directly. This means that the method does not include any transformation steps.  The major bottleneck in large SCF calculations is that the storage of the two-electron integrals scale proportional to N4. This would, for example, require for a molecular system of twenty carbon atoms the storage of 16,254,135,150 two-electron integrals (~ 40 GBytes). A biochemical compound range 10-10,000 atoms. This storage problem has been resolved at the expense of additional CPU time. In direct SCF the integral evaluation step is changed and the integrals are evaluated as they are needed in the Fock matrix generation and then discarded. Hence, the integrals are reevaluated each iteration.  Issues important to optimal performance are
     
    MP2 Theory
    The SCF model include electron-electron correlation only in an average field approximation. This is a serious limitation of the method. However, under the assumption that the effect of the explicit electron-electron correlation is small perturbation theory  can offer some assistance in improving the model. In perturbation theory the Hamiltonian is subdivided as
     
    where H0 is an operator for which we have the exact eigenfunctions and eigenvalues, lambda is the perturbation parameter and V is the perturbation. In Möller-Plesset perturbation theory H0 is the Fock operator, the eigenfunctions are the MOs of the SCF model and the eigenvalues are the associated orbital energies.
     
    Following the perturbation order by order of lambda we have that the first order energy is equivalent to the SCF energy, the first order wave function include electron correlation and the second order energy include explicit contributions due to the electron-electron correlation. The MP2 energy is expressed as (i,j denotes occupied orbitals and a,b denotes virtual/empty orbitals)
     

    and is a function of the two-electron integrals (in MO basis) and the orbital energies. In difference to the SCF method there is only an efficient MO based formulation of the methods, however, the method is non-iterative. The time limiting step of the MP2 method is the transformation of the two-electron integrals from AO to MO basis. The storage limiting step of the MP2 implementation is the storage of the transformed integrals which scale as N2occ*N2vir.
      As for conventional SCF MP2 have a serious storage bottleneck. However, unlike direct SCF, which could be AO driven, a direct MP2 implementation could only be formulated by storing subsets of the MO integrals and use the following equation
     

     
     

    with a minimal storage requirement of Nocc*N2vir. The expense of doing this  sectioning (of one of the occupied orbital indices) is increased CPU time. This  since all AO two-electron integrals contribute to all MO integrals in the transformation step. Hence, for each batch of MO integrals the full list of AO integrals is evaluated. In this respect the character of the implementation will be like in an iterative method. It is of course essential that the implementation carefully analyze the available computer resources such that a minimal number of subsections are created and that the number of times the AO integrals are reevaluated are kept to a minimum.
    The subsectioning could of course be on the virtual index - reducing the storage request further. However, there should always be a balance between how small the MO batch is and how many times the AO integrals are evaluated.

    Integral Transformation

    Most ab initio methods of today including explicit electron correlation can only be expressed in a MO basis. Hence, the transformation of the integrals  (AO to MO) is one of the most frequently applied steps in an quantum chemistry code. The straight forward expression for the transformation is

     

    however, this is not suitable for computer implementation since the operation count scale as N4AO*N4MO. Rather a four step procedure like

    would scale as N4AO*NMO. For convinience we denote the integral batch on the RHS in the first equation as a SO-block. As the SO-block subsequently are quarter transformed as we progress down the four equations we have (corresponding to the integral batch on the LHS) the Q1-, Q2-, Q3-, and Q4-block. Like in direct SCF and MP2 there is a possibility to arrange the 4-four index transformation so it can proceed without actually storing all the AO and partially transformed two-electron integrals. The implementation would look like

    Loop over I (batches of transformed integrals)
        Loop over m
           Loop over n
                Loop over k
                    Loop over l
                        Compute (mn|kl)
                    End
                    Transform (mn|kl) to (mn|ks)
                End
                Transform (mn|ks) to (mn|rs)
            End
            Transform (mn|rs) to (mq|rs)
        End
        Transform (mq|rs) to (pIq|rs)
        Process (pIq|rs)
    End

    CI Theory
    Another way of improving the wave function model as compared to SCF is the Configuration Interaction method. In this method the wave function is modeled as an linear combination of different electron configurations generated from distributing a set of electron in all possible ways in a common set of orbitals and maintaining spin and geometrical symmetry. Each configuration will in this sense look like a SCF wave function,
     
     

    The Schrödinger equation will now transform into the CI equation,

     

     
     

    where Hij are the elements of the CI matrix, <i|H|j>. Standard methods cannot be used for this eigenvalue problem. However, we are not interested in all eigenvalues. This for two reasons.

    This has lead to the development of an iterative method (the Davidson method) to find the lowest or the few lowest roots of the eigenfunction.  The method essentially projects the problem down to a much smaller space and solves the problem with conventional methods for this small eigenvalue problem.
      The principal and CPU time demanding step in the Davidson procedure is the formation of the so-called sigma-vector from trial vectors.
     

    Due to the large expansion sizes normally used one can not store the Hamiltonian matrix in memory but rather the elements of the matrix are formed on the fly as they are needed. In this implementation it is also important to utilize the inherent sparsity of the Hamiltonian matrix since configurations which are different by more than two electron interchanges have zero matrix elements. The parallelization of this step is to some extent trivial. Normally the Hamiltonian elements are constructed from data which is replicated on all nodes. The tasks which are distributed are to form the product between a subblock of the Hessian matrix and the associated two subsections of the trial vector yielding contributions to two subsections of the sigma-vector. By subsectioning the Hessian in an appropriate number of sub-blocks load balancing can be almost perfect.

    Summary

    This lecture has been a brief introduction to some of the methods used in computational chemistry. In particular the following matters where discussed
     

    in this discussion it was noticed that the evaluation of the two-electron integrals were
    Lecture 2: Introduction
    The second lecture will focus on some details in the implementation of parallel and direct Fock matrix construction and integral transformation. The reason for studying these approaches is not directly for the purpose of being able to perform parallel SCF and MP2, respectively. Rather the approaches together with parallel sigma-vector formations are the key building blocks of most ab  initio methods. Hence, they form the most important elements of any parallel ab initio implementation.
      It should be clear here on the onset of this lecture that the stratergies in doing parallel implementations is not an exact science. As the hardware developers refine their hardware and/or introduce new concepts and approaches to parallel  computers the most efficient implementations will change. It is up to the future computational scientists, like those of the NGSSC program, to understand the computational problem and the limitations of the hardware. In this respect new science can be performed. Hence, it is most likely that significant parts of this second lecture will be obsolete in a few years.
     
    Parallel Direct Fock Matrix Generation
    The key operation in a direct implementation of the Fock matrix generation is the computation of the two-electron integrals. Normally the integrals are computed in shell quadruplets for reasons of computational efficiency. That is the loop structure looks like
     

     Loop over iShell
        Loop over jShell
           Loop over kShell
               Loop over lShell
                     Compute all integrals of the Shell Quadruplet
                End Loop
            End Loop
         End Loop
     End Loop
     

    in the case of a normal integral generation (this loop structure is restricted to only generate the permutational unique shell quadruplets).  In the direct implementation the integrals are contracted directly with the 1st order density rather than written to disk. There are in general two Coulomb contributions and four exchange contributions for each integral (ij|kl)
     

    Density Matrix Fock Matrix  Type
    Dij Fkl Coulomb
    Dkl Fij Coulomb
    Dik Fjl exchange
    Dil Fjk  exchange
    Djk Fil exchange
    Djl Fik exchange
      In the parallel implementation of SCF one would have to consider two items, namely
     
     Scaling considerations
    The first matter is trivially solved in a replicated data implementation. Here the density matrices are communicated once every iteration to all nodes (scale as N2).  Each node holds a local copy of the Fock matrix. On completion the total Fock matrix is generated as a sum of all the local Fock matrices (scales similar to to the communication of the density). These communication costs should be compared with the N4 computational scaling of evaluating the two-electron integrals and contracting them with the density matrix. The replicated data implementation, however, has to be abandoned when the one-particle basis set is too large. Then a distributed data implementation would have to be considered. The administration of this is more complicated and the implementation is much more sensitive to the communication latency due the increased number of communication requests during the SCF iteration. However, the replicated data implementation can be maintained for rather large basis sets and utilization of the sparsity of density and Fock matrix allow replicated data algorithms to be used for basis sets in excess of 1000 basis functions.
     
    Load balancing
    The second matter in the parallel direct SCF implementation is the load balancing. Here one use the fact that the unique shell quadruplet index or ranges of that index can be used to define task for the individual nodes. Naturally the parallel implementation computes the most demanding shell quadruplets initially. The shorter tasks are used towards the end to activate any nodes which would otherwise be idle. This load balancing is in some sense trivial and parallel efficiency better than 90 percent is standard. A typical loop structure of a direct SCF code would be

    1. Get initial MOs
    2. Generate 1st order density
    3. Boadcast density
    4. Create task list
    5. Generate a global list to reserve task
    6. Generate local priority list
    7. Loop over tasks in local priority list
           Try to reserve task on the global list
                   If (.Not.Reserved) Then
                                Loop over task range
                         Do iSq = iSq_Start, iSq_End
                                Evaluate (ij|kl) and compute contributions to
                                        the Fock matrix
                         End Do
                   End If
       End Loop
    8. Global Summation of the Fock matrix
    9. Compute the MO update
    10. If not converged back to step 2
    11. Compute final energy and wave function.

    For a comparision look at the loop structure of the conventional SCF implementation. Before we continue on the direct MP2 (direct integral transformation) a few comments on the function of the various lists in the direct SCF.
     

    The Task List
    The task list is an two-dimensional array. The entries for a task specifies the starting and ending index of the range of shell quadrupletes associated with the task. To achive effective load balancing there should be
      The reason for the first point is obvious. The second criterion will ensure that towards the end of the calculation we will have short tasks so that all nodes could be kept busy until all shell quadruplets have been evaluated. To give an example of the consider the case of 3 shells and 3 nodes. The total number of unique shell quadruplets (also called batches) are 21 and a possible task list would be as follow
     
    Task Index Index of starting batch Index of last batch Number of batches
    1 1 4 4
    2 5 8 4
    3 9 12 4
    4 13 14 2
    5 15 16 2
    6 17 18 2
    7 19 19 1
    8 20 20 1
    9 21 21 1
     
     

     

    The Local Priority List
    The priority list is a list which indicate the order in which a specific node will try to execute the task. Let us continue with the example above as a demonstration case. The corresponding local priority lists would be
     
     
    Priority list on node 1 Priority list on node 2 Priority list on node 3
    1 2 3
    2 3 1
    3 1 2
    4 5 6
    5 6 4
    6 4 5
    7 8 9
    8 9 7
    9 7 8
     
     

    In case tasks with indentical size took the same time to compute we would have that node 1 computed tasks 1, 4, and 7, node 2 computed task 2, 5, and 8, and node 3 computed tasks 3, 6, and 9. However, this is not normally the case. Let us here see how the priority lists work  in such a case. Let us assume that the actual computational expense of the 9 tasks were 4.5, 4.0, 3.7, 2.4, 2.0, 1.9, 1.1, 1.0, and 0.9 rather than the assumed 4, 4, 4, 2, 2, 2, 1, 1, and 1. The distribution of the tasks would be node 1 tasks 1 and 4, node 2 tasks 2, 5, and 8, and node 3 tasks 3, 6, 9, and 7. The total wall clock would be 7.6 units to be compared to the 7.17 if perfect load balancing where achived (94 % parallel efficiency). Although the load balancing here is a bit crude it will work better and better as the number of tasks grow relative to the number of nodes.
     

    The Global Reservation List
    The global reservation list is a list which is accessible to all nodes. The list is number of task long and is on initiation set to zero. A reservation on the list is done with an incremental read, i.e. an operation which within a operation cycle reads and increments an element in an array. This eliminates that two processes can read the same value. A specific node which want to execute a task does the incremental read. If the read value is zero it has reserved it and can go ahead and process the task.

    Distributed Memory, Parallel, Integral Direct MP2
    This section of the lecture will describe the implementation of a distributed memory, parallel, integral direct MP2 algorithm. It will in a convincing way demonstrate that parallel computing is not only a matter of distributing a number of task on a parallel computer so that almost 100 % parallel efficiency is achieved. If this was the only thing we could do on a parallel computer we would always be in the situation that with some more patience, waiting a bit longer by you sequential computer, we would use the computer resources better. In effect parallel computing would be a waste of resources which we are prepared to pay since we so eagerly wants the results now.

    The key observation in going beyond linear scaling is to find two (or more) resources on a parallel computer which reduce the overall CPU time. The first resource is obvious - the CPUs. The second could, as in the case we describe here, be the memory. However, it could be any other resource which scale with the number of nodes. As it will be demonstrated below, the direct MP2 will benefit from both increased memory and computational power as we increase the number of nodes utilized by the implementation. As a matter of fact, there is a regime in which the speedup of this particular method scales quadratically to the number of nodes used in the calculation.

    For the discussion here we need to initially master the following items


     The implementation which we will discuss here is a course grain parallelization in which  the communication between different nodes are postponed until after the third-quarter transformation.  This will limit communication between nodes, however, the full permutational symmetry can not be used and the AO-integral list has to be evaluated twice as compared to a conventional implementation. After the third-quarter transformation step the first synchronization i established. The last quarter transformation is again local followed by a global antisymmetrization, a local summation of contributions to the MP2 energy and finally a global reduction of all partial contributions from the nodes to the total MP2 energy. The implementation of this algorithm using the fortran 77 and Global Array toolkit was done a few years ago on an IBM SP2 machine.

    (Reference: "An integral direct, distributed-data, parallel MP2 algorithm. ", M. Schütz and R. Lindh, Theo. Chim. Acta, 95, 13 (1997))
     



     
    Processing the SO to Q3 integral blocks
    The steps we will describe here are all conducted on a single node in a parallel computer and we will mainly discuss technical aspects on how local memory requirements can be reduced. This part of the direct integral transformation does not require any communication between the nodes. Using p,q,r,and s to denote the full range of the SO indices and pi   to denote the SO basis functions of shell iShell etc., we have a loop structure like
     

    For a fixed (iShell,jShell) and a given range of iB

        Loop kShell
            Loop lShell
                   Compute the SO block (piqj|rksl) and store
            End Loop
            Transform SO to Q1 and store
            (piqj|rks) --> (piqj|rkiB)
        End Loop
        Transform Q1 to Q2 and store
        (piqj|riB) --> (piqj|aiB)
        Accumulate contribution from Q2 to Q3 
        (piqj|aiB) -->(pij|aiB)

    The memory requirements at the different levels are (with the notation: N full range of the one-particle basis set,
    Nocc # of occupied orbitals, Nvir # of virtual orbitals and Ns # of SO basis functions in a shell) now
     
     

    Batch level Memory pool Minimal memory request
    SO Local Ns3 * N
    Q1 Local Ns2 * N
    Q2 Local Ns2 * Nvir
    Q3 Global Ns * Nocc * Nvir
     

    It is clear from this that the largest memory request is from the first quarter transformation step. This could for a larger calculations be a bottleneck, e.g. assuming a cc-pVQZ basis and 1000  basis functions we have a memory requirement of 26 Mbytes. However, this bottleneck can be avoided by distributing the contributions of the SO integrals to the Q1 block directly rather than to store a list where one index run over the full basis functions range.

    The modified loop structure on the individual nodes would then look like,
     
     

    For a fixed (iShell,jShell) and a given range of iB

        Loop kShell
            Loop lShell
                   Compute the SO block (piqj|rksl) and
                      Accumulate contribution to (piqj|rkiB)
            End Loop
        End Loop
        Transform Q1 to Q2 and store
        (piqj|riB) --> (piqj|aiB)
         Accumulate contribution from Q2 to Q3 
        (piqj|aiB) -->(pij|aiB)
     

    and the memory requirements would be
     

    Batch level Minimal memory request
    SO Ns4 
    Q1 Ns2 * N
    Q2 Ns2 * Nvir
    Q3 Ns * Nocc * Nvir
     
    Memory Partioning
    The memory partitioning in the parallel direct MP2 implementation is a balance between global and local memory requirement. Some pragmatic view on this matter is also health. It can be worthwhile to use a memory allocation which favor storage of MO integrals on the expence of a poorer performance in the AO integral formation and the subsequent transformation steps if this reduce the number of times the whole AO integral list has to be reevaluated. As indicated in the previous section the SO to Q3 processing have local memory requirement for the intermediate integrals whereas there is a global memory requirement to store subsections of the fully transformed integrals, (iBa|jb) and the three-quarter transformed integral (iBa|jp). The last quarter transformation reuse the local memory which is not needed more at that point. That is, all memory requirements can be facilitated  be dividing the memory into two pools. The memory is administrated as follows during the calculation

    Stage 1: (SO to Q3 transformation) one pool is used for the SO to the Q2-blocks. This section is administrated on a local level. The second spool, handled on a global level via the Global Array toolkit (GA) , is used for storage of the Q3-blocks.  The contributions to the Q3-blocks are distributed via the one-sided accumalate of GA. Hence, even if a contribution to a Q3-block is on another node there is no need to interrupt that node.

    Stage 2: GA is used to transpose the (iBa|jp) blocks so that the fastest index is the index over the full basis set range. This step moves the  Q3-blocks from the second to the first memory pool which is now governed by the GA toolkit.

    Stage 3: after the transpose the last quarter transformation is again only a local matter which can be completed with out communication. During this step the Q4-blocks are generated in the second memory pool.

    Stage4: the whole procedure is then terminated with an antisymmetrization step (which we will not discuss further here) and that each node computes the contributions to the MP2 energy followed by a global reduction.
     
     

    Load Balancing
    The load balancing in the parallel direct MP2 is a bit different from that of the SCF. The task list is defined by splitting up the ERI generation and the transformation up to the Q3 as individual tasks. A single task is then defined as the unique combination of two shells. Although the number of unique shell-pairs is not a very large number it has an adeqate size to allow for load balancing. The local prioritity list is deviced such that tasks which generate Q3-blocks residing on the node have higher priority. This will limit the communication up to the end of the Q3 step.  A global reservation list is maintained in a similar manner as for the direct SCF. 
     
     
    Performance
    As we disussed before this implementation has the capacity of scaling better than linear. This, however, happens only while the computation do not have enough memory to store all MO integrals. At the point when this happen further increase of nodes would only be quasi-linear (as in the case of direct SCF). Below is displayed the observed scaling of a calculation on the phenanthrene molecule (94 correlated electrons) computed with 412 contracted functions relative to the time on 2 nodes.

    It can be clearly observed how the implementation incrementally take advantage of the increased aggregate memory as the number of nodes increase.

    Summary
     
    A parallel direct SCF and a distributed memory parallel direct MP2 implementation have been described. Features which the two implementations had in common were
      Whereas the SCF trivially can be parallelized without any node to node communication the MP2 implementation has to be carefully analyzed in order to avoid communication and synchronization bottlenecks in the implementation. In the MP2 implementation it was demonstrated that some algorithms should show a scaling in performance improvement on a parallel computer which is better than linear. By identifying these algorithms - not only in chemistry but in any other field of science and subsequently implement them - new science can be made.

    The implementation discussed here benefited substantially from the use of GA toolkit. 


    Survey of Quantum Chemistry codes

    Below is a list of links to QC codes. Related links Software  utilities for development of parallel computational chemstry. Links to the Swedish national and regional computer centers