Parallel RI-MP2Matt GoldeyMøller–Plesset perturbationtheoryCommon correction for energies obtainedby the Hartree-Fock Self-ConsistentField methodComputational costs are significant forlarge systems, involving polynomialscaling, depending on approximationsusedMeritInclusion of electron correlationImages from Wikipedia<http://en.wikipedia.org/wiki/M%C3%B8ller%E2%80%93Plesset_perturbation_theory>Contributions represented as (ia|jb)Formed via integral transformations fromSCF calculationsRI approximationInstead of directly forming (ia|jb), (ia|X) isformed for the auxiliary basis X (alsodenoted P,Q)(P|Q)-1/2 used to form Bia=(ia|P)(P|Q)-1/2Largest costing step is multiplying(ia|jb)=Bia*Bjb for all a,b, auxiliaryfunctionsWorkloadNeed to span virtual space for each occupied pair (through alloccupied space)Simplest implementation:Loop iRead Bia ∀a,QLoop jRead Bjb ∀b,QForm (ia|jb)=Bia*(Bjb)TCalculate Energy contributionsNocc*Nocc matrix multiplications performedNocc + Nocc*Nocc B matrices read from hard drive -- ~Nocc2*Nvirt*X inhard drive readsNeed to span virtual space for each occupied pair (through alloccupied space)Simplest implementation:Loop iRead Bia ∀a,QLoop jRead Bjb ∀b,QForm (ia|jb)=Bia*(Bjb)TCalculate Energy contributionsNocc*Nocc matrix multiplications performedNocc + Nocc*Nocc B matrices read from hard drive -- ~Nocc2*Nvirt*X inhard drive readsAlgorithm OptimizationCurrent description of serial algorithm not complete -- only need tospan unique pairs of i,j --the upper or lower triangular portion ofthe occupied spaceLoop iRead Bia ∀a,QLoop j≥iRead Bjb ∀b,QForm (ia|jb)=Bia*(Bjb)TCalculate weighted energy contributionsNocc*(Nocc+1)/2 matrix multiplications performed -- computation cutby 1/2Nocc + Nocc*(Nocc+1)/2 B matrices read from hard drive --~Nocc(Nocc+1)/2*Nvirt*X in hard drive readsEven this is not optimized since reading in Bjb is unnecessary forj==iFurther, given that these matrix multiplications are relatively small,want to distribute work among processorsMemory ConsiderationsBia & Bjb are (Nvirt * X) long and are read in from hard drive asneededSerially, 2*(Nvirt*X)+Nvirt*Nvirt stored in memoryMemory delimited by earlier step to 3*X2Given Nvirt~N, X~4*N,Memory considerations for largest possible systems are3*4*4*N2=48*N2Memory occupied using this serial implementation for nprocessor shared memory system isnproc*(2*(Nvirt*X)+Nvirt*Nvirt)~nproc*(9*n2)For 8 processors, this amounts to 72*N2 in memory at a time,limiting system size unnecessarily -- need a different algorithmfor parallel computation for a work-distributed systemMemory-guided WorkDistributionNeed to minimize memory usage as wellas minimize number of reads from thehard drive -- which is expected tobecome dominant in the large systemlimitHave to satisfy both criteria in workdistribution schemeMemory-guided WorkDistributionSolutions:Hard drive read batching -- using the space availablePivoting -- using what is in memory to direct the next batchLoop over batches (determined by memory)Read Bia ∀i ∈(first batch-1 elements of batch)Loop over Nocc∉batch (Nocc>batch)Form all possible (ia|jb), including i==j (ia|ib)This provides a solution to the serial bottleneck of hard drive readsReduces the number of matrices read from ~Nocc2 (initialimplementation)New hard drive I/O costs:Nreads=Nocc+(Nocc-(batchsize-1)-1)+(Nocc-2*(batchsize-1)-1)+…Nreads=Nocc(Nbatches+1)-Nbatches-(batchsize-1)(Nbatches(Nbatches+1)/2)Nreads~Nocc*Nbatches-(batchsize-1)Nbatches2/2Nreads~Nocc2/(batchsize-1)-Nocc2/(2*(batchsize-1))~=Nocc2/(2*(batchsize-1))This solves the hard drive read problem in serial, but does it work forparallel?X X X X X X X X X XX X X X X X X X XX X X X X X X XXXXXXXXX X X X X X X X X XX X X X X X X X XX X X X X X X XX X X X X X XX X X X X XX X X X XXXXXX X X X X X X X X XX X X X X X X X XX X X X X X X XX X X X X X XX X X X X XX X X X XX X X XX XXBatch size limitationsTypical system sizes constrain the batch size toa maximum of ~9-11 B matrices in memory atonceUsing 9 B matrices, ignoring edge cases, 8computations must be performed at eachstep, with each computation of equivalentcostHowever, this method requires synchronizationafter each matrix multiplication -- a potentialbottleneck for anisotropic systemsDistributed Memory, theFutureFurther modeling is needed for extendingthis method to distributed memorysystems since batches will not haveequivalent workloads -- the occupiedspace needing to be spanned isdiminished by each
View Full Document