UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms - D2909326

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 395T> A Family of High-Performance Matrix Multiplication Algorithms

DOC PREVIEW

UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms

School name University of Texas at Austin

Course Cs 395t- Multicore Operating Systems Implementation

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Family of High-Performance Matrix Multiplication Algorithms∗John A. Gunnels†Greg M. Henry‡Robert A. van de Geijn§A Technical Paper Submitted to the International Conference on Computer Science 2001AbstractDuring the last half-decade, a number of research efforts have centered around developingsoftware for generating automatically tuned matrix multiplication kernels. These include thePHiPAC project and the ATLAS project. The software products of both projects employ bruteforce to search a parameter space for blockings that accommodate multiple levels of memoryhierarchy. We take a different approach. Using a simple model of hierarchical memories weemploy mathematics to determine a locally-optimal strategy for blocking matrices. The theo-retical results show that, depending on the shape of the matrices involved, different strategiesare locally-optimal. Rather than determining a blocking strategy at library generation time,the theoretical results show that ideally one should pursue a heuristic that allows the blockingstrategy to be determined dynamically at run-time as a function of the shapes of the operands.When the resulting family of algorithms is combined with a highly optimized inner-kernel for asmall matrix multiplication, the approach yields performance that is superior to that of meth-ods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) IIIprocessor, support the theoretical insights.1 IntroductionResearch in the development of linear algebra libraries has recently shifted to the automatic gener-ation and optimization of the matrix multiplication kernels. The idea is that many linear algebraoperations can be implemented in terms of matrix multiplication [3, 11, 7] and that thus it is thisoperation that should be highly optimized on different platforms. Since the coding effort is con-siderable, especially when multiple layers of cache are involved, the general concensus is that thisprocess should be automated.In this paper, we develop a theoretical framework that (1) suggests a formula for the blocksizes that should be used at each level of the memory hierarchy, and (2) restricts the possible looporderings to a specific family of algorithms for matrix multiplication. Together, we show how touse these results to build highly optimized matrix multiplication implementations that utilize thecaches in a locally-optimal fashion. The results could be equally well used to limit the search spacethat must be examined by packages that automatically tune such kernels.∗This work was partially performed at the Jet Propulsion Laboratory, California Institute of Technology, under acontract with the National Aeronautics and Space Administration. The work was funded by the Remote Explorationand Experimentation Project (a part of the NASA High Performance Computing and Communications Programfunded by the NASA Office of Space Science.)†Department of Computer Sciences, The University of Texas, Austin, TX 78712, [email protected]‡Intel Corp., Bldg EY2-05, 5350 NE Elam Young Pkwy, Hillsboro, OR 97124-6461, [email protected]§Department of Computer Sciences, The University of Texas, Austin, TX 78712, [email protected] current pursuit of highly optimized matrix kernels achieved by coding in a high-level pro-gramming language started with the implementation of the FORTRAN implementation of Ba-sic linear Algebra Subprograms (BLAS) [5] for the IBM Power2 [1]. Subsequently, the PHiPACproject [4] at UC-Berkeley demonstrated that high-performance matrix multiplication kernels canbe written in C and that code generators could be used to automatically generate many differentblockings, allowing automatic tuning. Next, the Automatically Tuned Linear Algebra Software(ATLAS) project [12] at the University of Tennessee extended the ideas developed as part of thePHiPAC project by reducing the kernel that is called once matrices are massaged to be in the L1cache into one specific case: C = ATB + βC for small matrices A, B, and C and reducing thespace searched for optimal blockings. Furthermore it marketed the methodology allowing it to gainwide-spread acceptance and igniting the current craze in the linear algebra community towardsautomatically tuned libraries. Finally, there has been a considerable recent interest in recursivealgorithms and recursive data structures. The idea here is that by recursively partitioning theoperands blocks that fit in the different levels of the caches will automatically be encountered [9].By storing matrices recursively, blocks that are encountered during the execution of the recursivealgorithms will be in contiguous memory [2, 8, 10].Other work closely related to this topic is discussed in other papers presented as part of thissession of the conference.2 Notation and Terminology2.1 Special cases of matrix multiplicationThe general form of a matrix multiply is C ← α AB + βC where C is m × n, A is m × k, andB is k × n. We will use the following terminology when referring to a matrix multiply when twodimensions are large and one is small:Condition ShapeMatrix-panel multiply n is smallC=A B+CPanel-matrix multiply m is smallC=AB+CPanel-panel multiply k is smallC=AB+CThe following observation will become key to understanding concepts encountered in the rest ofthe paper: Partition X =³X1··· XNX´=ˆX1...ˆXMXfor X ∈ {A, B, C}, where Cjis m ×nj,ˆCiis mi× n, Apis m × kp,ˆAiis mi× k, Bjis k × nj, andˆBpis kp× n. Then C ← AB + C canbe achieved by2multiple matrix-panelmultiplies:Cj← ABj+ Cjfor j = 1, . . . , NCC1C2C3+=AB1B1B1multiple panel-matrixmultiplies:ˆCi←ˆAiB +ˆCifor i = 1, . . . , MCˆC1ˆC2ˆC3+=ˆA1ˆA2ˆA3Bmultiple panel-panelmultipliesC ←PNApApˆBp+ CC+=A1A2A3ˆB1ˆB2ˆB32.2 A cost model for hierarchical memoriesThe memory hierarchy of a modern microprocessor is often viewed as a pyramid: At the top of thepyramid, there are the processor registers, with extremely fast access. At the bottom, there aredisks and even slower media. As one goes down the pyramid, while the cost of memory decreases,the amount of memory increases along with the time required to access that that memory.We will model the ab ove-mentioned hierarchy naively as follows: (1) The memory hierarchyconsists of H levels, indexed 0, . . . , H − 1. Level 0 corresponds to the registers. We will oftendenote the ith level by Li. Notice that on a typical current architecture L1and L2correspond thelevel 1

View Full Document

UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 12 pages.

UT CS 395T - A Family of High-Performance Matrix Multiplication Algorithms

Sign up for free to view:

Please select your school