DOC PREVIEW
UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Family of High-Performance Matrix Multiplication Algorithms∗John A. Gunnels†Greg M. Henry‡Robert A. van de Geijn§A Technical Paper Submitted to the International Conference on Computer Science 2001AbstractDuring the last half-decade, a number of research efforts have centered around developingsoftware for generating automatically tuned matrix multiplication kernels. These include thePHiPAC project and the ATLAS project. The software products of both projects employ bruteforce to search a parameter space for blockings that accommodate multiple levels of memoryhierarchy. We take a different approach. Using a simple model of hierarchical memories weemploy mathematics to determine a locally-optimal strategy for blocking matrices. The theo-retical results show that, depending on the shape of the matrices involved, different strategiesare locally-optimal. Rather than determining a blocking strategy at library generation time,the theoretical results show that ideally one should pursue a heuristic that allows the blockingstrategy to be determined dynamically at run-time as a function of the shapes of the operands.When the resulting family of algorithms is combined with a highly optimized inner-kernel for asmall matrix multiplication, the approach yields performance that is superior to that of meth-ods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) IIIprocessor, support the theoretical insights.1 IntroductionResearch in the development of linear algebra libraries has recently shifted to the automatic gener-ation and optimization of the matrix multiplication kernels. The idea is that many linear algebraoperations can be implemented in terms of matrix multiplication [3, 11, 7] and that thus it is thisoperation that should be highly optimized on different platforms. Since the coding effort is con-siderable, especially when multiple layers of cache are involved, the general concensus is that thisprocess should be automated.In this paper, we develop a theoretical framework that (1) suggests a formula for the blocksizes that should be used at each level of the memory hierarchy, and (2) restricts the possible looporderings to a specific family of algorithms for matrix multiplication. Together, we show how touse these results to build highly optimized matrix multiplication implementations that utilize thecaches in a locally-optimal fashion. The results could be equally well used to limit the search spacethat must be examined by packages that automatically tune such kernels.∗This work was partially performed at the Jet Propulsion Laboratory, California Institute of Technology, under acontract with the National Aeronautics and Space Administration. The work was funded by the Remote Explorationand Experimentation Project (a part of the NASA High Performance Computing and Communications Programfunded by the NASA Office of Space Science.)†Department of Computer Sciences, The University of Texas, Austin, TX 78712, [email protected]‡Intel Corp., Bldg EY2-05, 5350 NE Elam Young Pkwy, Hillsboro, OR 97124-6461, [email protected]§Department of Computer Sciences, The University of Texas, Austin, TX 78712, [email protected] current pursuit of highly optimized matrix kernels achieved by coding in a high-level pro-gramming language started with the implementation of the FORTRAN implementation of Ba-sic linear Algebra Subprograms (BLAS) [5] for the IBM Power2 [1]. Subsequently, the PHiPACproject [4] at UC-Berkeley demonstrated that high-performance matrix multiplication kernels canbe written in C and that code generators could be used to automatically generate many differentblockings, allowing automatic tuning. Next, the Automatically Tuned Linear Algebra Software(ATLAS) project [12] at the University of Tennessee extended the ideas developed as part of thePHiPAC project by reducing the kernel that is called once matrices are massaged to be in the L1cache into one specific case: C = ATB + βC for small matrices A, B, and C and reducing thespace searched for optimal blockings. Furthermore it marketed the methodology allowing it to gainwide-spread acceptance and igniting the current craze in the linear algebra community towardsautomatically tuned libraries. Finally, there has been a considerable recent interest in recursivealgorithms and recursive data structures. The idea here is that by recursively partitioning theoperands blocks that fit in the different levels of the caches will automatically be encountered [9].By storing matrices recursively, blocks that are encountered during the execution of the recursivealgorithms will be in contiguous memory [2, 8, 10].Other work closely related to this topic is discussed in other papers presented as part of thissession of the conference.2 Notation and Terminology2.1 Special cases of matrix multiplicationThe general form of a matrix multiply is C ← α AB + βC where C is m × n, A is m × k, andB is k × n. We will use the following terminology when referring to a matrix multiply when twodimensions are large and one is small:Condition ShapeMatrix-panel multiply n is smallC=A B+CPanel-matrix multiply m is smallC=AB+CPanel-panel multiply k is smallC=AB+CThe following observation will become key to understanding concepts encountered in the rest ofthe paper: Partition X =³X1··· XNX´=ˆX1...ˆXMXfor X ∈ {A, B, C}, where Cjis m ×nj,ˆCiis mi× n, Apis m × kp,ˆAiis mi× k, Bjis k × nj, andˆBpis kp× n. Then C ← AB + C canbe achieved by2multiple matrix-panelmultiplies:Cj← ABj+ Cjfor j = 1, . . . , NCC1C2C3+=AB1B1B1multiple panel-matrixmultiplies:ˆCi←ˆAiB +ˆCifor i = 1, . . . , MCˆC1ˆC2ˆC3+=ˆA1ˆA2ˆA3Bmultiple panel-panelmultipliesC ←PNApApˆBp+ CC+=A1A2A3ˆB1ˆB2ˆB32.2 A cost model for hierarchical memoriesThe memory hierarchy of a modern microprocessor is often viewed as a pyramid: At the top of thepyramid, there are the processor registers, with extremely fast access. At the bottom, there aredisks and even slower media. As one goes down the pyramid, while the cost of memory decreases,the amount of memory increases along with the time required to access that that memory.We will model the ab ove-mentioned hierarchy naively as follows: (1) The memory hierarchyconsists of H levels, indexed 0, . . . , H − 1. Level 0 corresponds to the registers. We will oftendenote the ith level by Li. Notice that on a typical current architecture L1and L2correspond thelevel 1


View Full Document

UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms

Documents in this Course
TERRA

TERRA

23 pages

OpenCL

OpenCL

15 pages

Byzantine

Byzantine

32 pages

Load more
Download A Family of High-Performance Matrix Multiplication Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Family of High-Performance Matrix Multiplication Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Family of High-Performance Matrix Multiplication Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?