The purpose of the two following lectures is to give a brief introduction to parallel implementations in ab initio theory. The key operations listed below are the most important procedures in ab initio and semi-empirical theory and have been the subject to parallel implementations for the last 15 years and will be the focus of these lectures.
The nature of ab initio calculations with respect to the possibility of course grain parallelization will be addressed. The first lecture will give some brief theoretical background to theory in ab initio. This will form the frame work in which we will base our discussion on how to parallelize codes in computational chemistry. The second lecture will be of a more detailed study of strategies for parallelization of the Fock matrix generation and integral transformation.
The format of the notes is a bit
different from conventional overheads in the sense that the must carry
more information. The lecturer will together with the overheads give the
whole picture. For notes on the web the information has to be more complete
making the pages unacceptable in general as overheads. The lectures are
planned to be converted to overheads in the near future.
Table of content
The lecture will introduce the concepts of


the reason for this transition lies in the computaional
advantages of Gaussians vs. Slaters in that the former allows for analytic
expressions for the matrix elements which are used in computational chemistry.
One of the fundamental properties
of Gaussians is manifested in the Gaussian
product theorem (GPT) which states that the product of two Gaussians can
be expressed as a new Gaussian.

where

or expressed graphically


To reduce the number of parameters
to optimize in the ab initio procedures one usually contract the primitive
Gaussians to contracted Gaussian functions as
the contracted basis is normally refered to as the atomic orbital (AO) basis. This basis will later be used to construct the molecular orbitals (MOs) as they are defined in the computational models. The contractions schemes can be divided into two groups, namely the segmented contraction scheme


The former has computational advantages but lacks the high quality of the latter. Standard basis sets today for large systems are of segmented type whereas for smaller system one usually use the general contraction scheme.
| Basis | Prim./Cont. | # of functions | Type |
| STO-3G | 6s3p/2s1 | 2/3 | Segmented |
| 6-31G | 10s4p/3s2p | 3/6 | Segmented |
| cc-pVDZ | 9s4p/3s2p | 3/6 | Segmented |
| cc-pVTZ | 10s5p2d1f/4s3p2d1f | 4/9/10/7 | Segmented |
| cc-pVQZ | 12s6p3d2f1g/5s4p3d2f1g | 5/12/15/14/9 | Segmented |
| ANO-L | 14s9p4d3f/6s5p3d2f | 6/15/15/14 | General |
where H is the Hamiltonian, Psi is the wave function, and E is the total energy of the system. This equation have the same properties as a typical eigenfunction problem and if the one-particle space is expanded in a finite basis we will observe that operations normally encountered in linear algebra, matrix-matrix, matrix-vector mathematics are of significant importance in computational chemistry. Possibly with the difference that the matrix elements in ab initio theory are so many and so expensive to evaluate that some unconventional use of standard methods actually are the fastest in the computer implementations of ab initio theory. Whereas the wave function vary for different ab initio models the Hamiltonian in the Born-Oppenheimer approximation is always defined as

where the different
terms are, starting from the left, 1) the electronic kinetic energy,
2) the nuclear-electron attraction,
3) the electron-electron repulsion, and 4) the nuclear-nuclear
repulsion term. To solve the Schrödinger equation the matrix
elements over these operators need to be computed. The two first
terms are collected in the so-called one-electron term, the third is the
so-called the 2-electron term, and the forth is called the nuclear potential
energy term. The evaluation of these matrix elements will be, with respect
to CPU time, dominated by the 2-electron term. In parallelizing ab initio
code special care has to be taken to parallelize this evaluation.

These terms are also denoted as
the two-electron integrals or electron repulsion integrals (ERIs).
The computational expense to evaluate the integrals scale proportional
to N4 (where N is the total size of the one-particle basis set).
To reduce the computational cost of the i ntegrals it is important to use
the permutational symmetry
to only compute the unique shell quadruplets.
The unique index of a integral batch is formed according to

| Class | # of integrals |
| (ss|ss) | 256 |
| (ss|sp) | 576 |
| (ss|pp) | 1.296 |
| (sp|pp) | 2.916 |
| (pp|pp) | 6.561 |
| ... | ... |
| (dd|dd) | 10.000 |
| ... | ... |
In the elf onsistent ield theory (Hartree-Fock theory) the wave function is approximated by a Slater-determinant. This determinant has the feature of modeling the interchange of particles which is correct for a Fermi particle system, i.e. a change of sign of the wave function under particle permutation. The optimization of the energy is achieved by minimizing the total energy with respect to the coefficients of the molecular orbitals as they are expressed in the atomic orbital basis (contracted Gaussians). The energy of the SCF wave function is expressed as

where (in closed shell restricted HF)
are the so-called 1st order densities and Epq is the generator of the unitary group. Using orthogonal MOs reduces the 1st order density in closed shell RHF case to

as a function of the current 1st order density and
the MO integrals. The Fock matrix is the equivalent to the gradient of
the energy with respect to orbital rotations. In particular, in the MO
basis the elements between occupied and virtual orbitals should be zero
(Brillouin-condition) at convergence.
The Fock matrix is in the SCF approach diagonalized
to give a new MO description. This procedure is repeted until the diagonalization
procedure does not yield any significant rotations of the MOs (self consistency).
The optimization of the MOs could also of course be formulated as an quasi-Newton
step - eliminating the diagonalization step. This is of importance
since the diagonalization procedure is not optimal for parallel implementations.
The Fock matrix can also be evaluated
directly in the AO basis

This eliminate the need for any four-index transformation during the SCF iterations. The Fock matrix in the AO basis is also diagonalized or used in the quasi-Newton step. The only possible difference between the AO and MO scheme is how the rotations of the orbitals are expressed, i.e. are they relative to the old MOs or are they expressed directly in the AOs? The computer implementation will have the following structure

and is a function of the two-electron integrals (in
MO basis) and the orbital energies. In difference to the SCF method there
is only an efficient MO based formulation of the methods, however, the
method is non-iterative. The time limiting step of the MP2 method
is the transformation of the two-electron integrals from AO to MO basis.
The storage limiting step of the MP2 implementation is the storage of the
transformed integrals which scale as N2occ*N2vir.
As for conventional SCF
MP2 have a serious storage bottleneck. However, unlike direct SCF, which
could be AO driven, a direct MP2 implementation
could only be formulated by storing subsets of the MO integrals and use
the following equation

with a minimal storage requirement of Nocc*N2vir.
The expense of doing this sectioning (of one of the occupied
orbital indices) is increased CPU time. This since all AO two-electron
integrals contribute to all MO integrals in the transformation step. Hence,
for each batch of MO integrals the full list of AO integrals is evaluated.
In this respect the character of the implementation will be like in an
iterative method. It is of course essential that the implementation carefully
analyze the available computer resources such that a minimal number of
subsections are created and that the number of times the AO integrals are
reevaluated are kept to a minimum.
The subsectioning could of course
be on the virtual index - reducing the storage request further. However,
there should always be a balance between how small the MO batch is and
how many times the AO integrals are evaluated.
Most ab initio methods of today including explicit electron correlation can only be expressed in a MO basis. Hence, the transformation of the integrals (AO to MO) is one of the most frequently applied steps in an quantum chemistry code. The straight forward expression for the transformation is

however, this is not suitable for computer implementation since the operation count scale as N4AO*N4MO. Rather a four step procedure like

would scale as N4AO*NMO. For convinience we denote the integral batch on the RHS in the first equation as a SO-block. As the SO-block subsequently are quarter transformed as we progress down the four equations we have (corresponding to the integral batch on the LHS) the Q1-, Q2-, Q3-, and Q4-block. Like in direct SCF and MP2 there is a possibility to arrange the 4-four index transformation so it can proceed without actually storing all the AO and partially transformed two-electron integrals. The implementation would look like
Loop over I (batches of transformed integrals)
Loop over m
Loop over
n
Loop over k
Loop over l
Compute (mn|kl)
End
Transform (mn|kl) to (mn|ks)
End
Transform (mn|ks) to (mn|rs)
End
Transform
(mn|rs) to (mq|rs)
End
Transform (mq|rs) to (pIq|rs)
Process (pIq|rs)
End

The Schrödinger equation will now transform into the CI equation,

where Hij are the elements of the CI matrix, <i|H|j>. Standard methods cannot be used for this eigenvalue problem. However, we are not interested in all eigenvalues. This for two reasons.
Due to the large expansion sizes normally used one can not store the Hamiltonian matrix in memory but rather the elements of the matrix are formed on the fly as they are needed. In this implementation it is also important to utilize the inherent sparsity of the Hamiltonian matrix since configurations which are different by more than two electron interchanges have zero matrix elements. The parallelization of this step is to some extent trivial. Normally the Hamiltonian elements are constructed from data which is replicated on all nodes. The tasks which are distributed are to form the product between a subblock of the Hessian matrix and the associated two subsections of the trial vector yielding contributions to two subsections of the sigma-vector. By subsectioning the Hessian in an appropriate number of sub-blocks load balancing can be almost perfect.
This lecture has been a brief
introduction to some of the methods used in computational chemistry. In
particular the following matters where discussed
Loop over iShell
Loop over jShell
Loop over
kShell
Loop over lShell
Compute all integrals of the Shell Quadruplet
End Loop
End
Loop
End Loop
End Loop
in the case of a normal integral generation (this loop
structure is restricted to only generate the permutational unique shell
quadruplets). In the direct implementation the integrals are contracted
directly with the 1st order density rather than written to disk. There
are in general two Coulomb contributions and four exchange contributions
for each integral (ij|kl)
| Density Matrix | Fock Matrix | Type |
| Dij | Fkl | Coulomb |
| Dkl | Fij | Coulomb |
| Dik | Fjl | exchange |
| Dil | Fjk | exchange |
| Djk | Fil | exchange |
| Djl | Fik | exchange |
1. Get initial MOs
2. Generate 1st order density
3. Boadcast density
4. Create task list
5. Generate a global list to reserve task
6. Generate local priority list
7. Loop over tasks in local priority list
Try to reserve
task on the global list
If (.Not.Reserved) Then
Loop over task range
Do iSq = iSq_Start, iSq_End
Evaluate (ij|kl) and compute contributions to
the Fock matrix
End Do
End If
End Loop
8. Global Summation of the Fock matrix
9. Compute the MO update
10. If not converged back to step 2
11. Compute final energy and wave function.
For a comparision look at the
loop structure of the conventional
SCF implementation. Before we continue
on the direct MP2 (direct integral transformation) a few comments on the
function of the various lists in the direct SCF.
| Task Index | Index of starting batch | Index of last batch | Number of batches |
| 1 | 1 | 4 | 4 |
| 2 | 5 | 8 | 4 |
| 3 | 9 | 12 | 4 |
| 4 | 13 | 14 | 2 |
| 5 | 15 | 16 | 2 |
| 6 | 17 | 18 | 2 |
| 7 | 19 | 19 | 1 |
| 8 | 20 | 20 | 1 |
| 9 | 21 | 21 | 1 |
| Priority list on node 1 | Priority list on node 2 | Priority list on node 3 |
| 1 | 2 | 3 |
| 2 | 3 | 1 |
| 3 | 1 | 2 |
| 4 | 5 | 6 |
| 5 | 6 | 4 |
| 6 | 4 | 5 |
| 7 | 8 | 9 |
| 8 | 9 | 7 |
| 9 | 7 | 8 |
In case tasks with indentical size took the same time
to compute we would have that node 1 computed tasks 1, 4, and 7, node 2
computed task 2, 5, and 8, and node 3 computed tasks 3, 6, and 9. However,
this is not normally the case. Let us here see how the priority lists work
in such a case. Let us assume that the actual computational expense
of the 9 tasks were 4.5, 4.0, 3.7, 2.4, 2.0, 1.9, 1.1, 1.0, and 0.9 rather
than the assumed 4, 4, 4, 2, 2, 2, 1, 1, and 1. The distribution of the
tasks would be node 1 tasks 1 and 4, node 2 tasks 2, 5, and 8, and node
3 tasks 3, 6, 9, and 7. The total wall clock would be 7.6 units to be compared
to the 7.17 if perfect load balancing where achived (94 % parallel efficiency).
Although the load balancing here is a bit crude it will work better and
better as the number of tasks grow relative to the number of nodes.
The key observation in going beyond linear scaling is to find two (or more) resources on a parallel computer which reduce the overall CPU time. The first resource is obvious - the CPUs. The second could, as in the case we describe here, be the memory. However, it could be any other resource which scale with the number of nodes. As it will be demonstrated below, the direct MP2 will benefit from both increased memory and computational power as we increase the number of nodes utilized by the implementation. As a matter of fact, there is a regime in which the speedup of this particular method scales quadratically to the number of nodes used in the calculation.
For the discussion here we need to initially master the following items
The implementation which we will discuss here is a course grain parallelization in which the communication between different nodes are postponed until after the third-quarter transformation. This will limit communication between nodes, however, the full permutational symmetry can not be used and the AO-integral list has to be evaluated twice as compared to a conventional implementation. After the third-quarter transformation step the first synchronization i established. The last quarter transformation is again local followed by a global antisymmetrization, a local summation of contributions to the MP2 energy and finally a global reduction of all partial contributions from the nodes to the total MP2 energy. The implementation of this algorithm using the fortran 77 and Global Array toolkit was done a few years ago on an IBM SP2 machine.
(Reference: "An integral direct, distributed-data, parallel MP2 algorithm.
", M. Schütz and R. Lindh, Theo. Chim. Acta, 95, 13
(1997))
For a fixed (iShell,jShell) and a given range of iB
Loop kShell
Loop
lShell
Compute the SO block (piqj|rksl)
and store
End
Loop
Transform
SO to Q1 and store
(piqj|rks)
--> (piqj|rkiB)
End Loop
Transform Q1 to Q2 and store
(piqj|riB)
--> (piqj|aiB)
Accumulate contribution from
Q2 to Q3
(piqj|aiB)
-->(pij|aiB)
The memory requirements at the different levels are
(with the notation: N full range of the one-particle basis set,
Nocc # of occupied
orbitals, Nvir # of virtual orbitals and
Ns # of SO basis functions
in a shell) now
| Batch level | Memory pool | Minimal memory request |
| SO | Local | Ns3 * N |
| Q1 | Local | Ns2 * N |
| Q2 | Local | Ns2 * Nvir |
| Q3 | Global | Ns * Nocc * Nvir |
It is clear from this that the largest memory request is from the first quarter transformation step. This could for a larger calculations be a bottleneck, e.g. assuming a cc-pVQZ basis and 1000 basis functions we have a memory requirement of 26 Mbytes. However, this bottleneck can be avoided by distributing the contributions of the SO integrals to the Q1 block directly rather than to store a list where one index run over the full basis functions range.
The modified loop structure on the individual nodes
would then look like,
For a fixed (iShell,jShell) and a given range of iB
Loop kShell
Loop
lShell
Compute the SO block (piqj|rksl)
and
Accumulate contribution to (piqj|rkiB)
End
Loop
End Loop
Transform Q1 to Q2 and store
(piqj|riB)
--> (piqj|aiB)
Accumulate contribution
from Q2 to Q3
(piqj|aiB)
-->(pij|aiB)
and the memory requirements would be
| Batch level | Minimal memory request |
| SO | Ns4 |
| Q1 | Ns2 * N |
| Q2 | Ns2 * Nvir |
| Q3 | Ns * Nocc * Nvir |
Stage 1: (SO to Q3 transformation) one pool is used for the SO to the Q2-blocks. This section is administrated on a local level. The second spool, handled on a global level via the Global Array toolkit (GA) , is used for storage of the Q3-blocks. The contributions to the Q3-blocks are distributed via the one-sided accumalate of GA. Hence, even if a contribution to a Q3-block is on another node there is no need to interrupt that node.

Stage 2: GA is used to transpose the (iBa|jp) blocks so that the fastest index is the index over the full basis set range. This step moves the Q3-blocks from the second to the first memory pool which is now governed by the GA toolkit.

Stage 3: after the transpose the last quarter transformation is again only a local matter which can be completed with out communication. During this step the Q4-blocks are generated in the second memory pool.

Stage4: the whole procedure is then terminated with
an antisymmetrization step (which we will not discuss further here) and
that each node computes the contributions to the MP2 energy followed by
a global reduction.
It can be clearly observed how the implementation incrementally take advantage of the increased aggregate memory as the number of nodes increase.
The implementation discussed here benefited substantially from the use of GA toolkit.