New version page

Auto-tuning Performance on Multicore Computers

This preview shows page 1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-112-113-114-115-116-117-118-119-120-121-122-123-124-125-126-127-226-227-228-229-230-231-232-233-234-235-236-237-238-239-240-241 out of 241 pages.

View Full Document
View Full Document

End of preview. Want to read all 241 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Auto-tuning Performance on Multicore ComputersSamuel Webb WilliamsElectrical Engineering and Computer SciencesUniversity of California at BerkeleyTechnical Report No. UCB/EECS-2008-164http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.htmlDecember 17, 2008Copyright 2008, by the author(s).All rights reserved. Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.Auto-tuning Performance on Multicore ComputersbySamuel Webb WilliamsB.S. (Southern Methodist University) 1999B.S. (Southern Methodist University) 1999M.S. (University of California, Berkeley) 2003A dissertation submitted in partial satisfaction of therequirements for the degree ofDoctor of PhilosophyinComputer Sciencein theGRADUATE DIVISIONof theUNIVERSITY OF CALIFORNIA, BERKELEYCommittee in charge:Professor David A. Patterson, ChairProfessor Katherine YelickProfessor Sara McMainsFall 2008Auto-tuning Performance on Multicore ComputersCopyright 2008bySamuel Webb Williams1AbstractAuto-tuning Performance on Multicore ComputersbySamuel Webb WilliamsDoctor of Philosophy in Computer ScienceUniversity of California, BerkeleyProfessor David A. Patterson, ChairFor the last decade, the exponential potential of Moore’s Law has bee n squan-dered in the effort to increase single thread performance, which is now limited by thememory, instruction, and power walls. In response, the computing industry has boldlyplaced its hopes on the multicore gambit. That is, abandon instruction-level parallelismand frequency-scaling in favor of the exponential scaling of the number of compute coresper microprocessor. The massive thread-level parallelism results in tremendous potentialperformance, but demands efficient parallel programming — a task existing software toolsare ill-equipped for. We desire performance portability — the ability to write a programonce and not only have it deliver good performance on the development computer, but onall multicore computers today and tomorrow.This thesis accepts for fact that multicore is the basis for all future computers.Furthermore, we regiment our study by organizing it around the computational patternsand motifs as set forth in the Berkeley View. Although domain experts may be extremelyknowledgeable on the mathematics and algorithms of their fields, they often lack the de-tailed computer architecture knowledge required to achieve high performance. Forthcomingheterogeneous architectures will exacerbate the problem for everyone. Thus, we extendthe auto-tuning approach to program optimization and performance portability to themenagerie of multicore computers. In an automated fashion, an auto-tuner will explorethe optimization space for a particular computational kernel of a motif on a particular com-puter. In doing so, it will determine the best combination of algorithm, implementation,and data structure for the combination of architecture and input data.We implement and evaluate auto-tuners for two im portant kernels: Lattice Boltz-mann Magnetohydrodynamics (LBMHD) and sparse matrix-vector multiplication (SpMV).They are representative of two of the computational motifs: structured grids and sparselinear algebra. To demonstrate the performance portability that our auto-tuners deliver, weselected an extremely wide range of architectures as an experimental test bed. These includeconventional dual- and quad-core superscalar x86 processors both with and without inte-grated memory controllers. We also include the rather unconventional chip multithreaded(CMT) Sun Niagara2 (Victoria Falls) and the heterogeneous, local s tore-base d IBM CellBroadband Engine. In some experiments we sacrifice the performance portability of a com-mon C representation, by creating ISA-specific auto-tuned versions of these kernels to gainarchitectural insight. To quantify our success, we created the Roofline model to perform abound and bottleneck analysis for each kernel-architecture combination.Despite the common wisdom that LBMHD and SpMV are memory bandwidth-bound, and thus nothing can be done to improve performance, we show that auto-tuning2consistently delivers speedups in excess of 3× across all multicore computers except thememory-bound Intel Clovertown, where the benefit was as little as 1.5×. The Cell pro-cessor, with its explicitly m anaged memory hierarchy, showed far more dramatic speedupsof between 20× and 130×. The auto-tuners includes both architecture-independent opti-mizations based solely on source code transformations and high-level kernel knowledge, aswell as architecture-specific optimizations like the explicit use of single instruction, multipledata (SIMD) extensions or the use Cell’s DMA-based memory operations. We observe thatthe these ISA-spe cific optimizations are becoming increasingly important as architecturesevolve.Professor David A. PattersonDissertation Committee ChairiTo those who always believed in me,even when I didn’t.iiContentsList of Figures viiList of Tables xiList of symbols xiii1 Introduction 12 Motivation and Background 52.1 Why Optimize for Performance? . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Trends in Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Frequency and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Single Thread Performance . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 The Multicore Gambit . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.5 DRAM Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.6 DRAM Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.7 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.8 Productivity, Programmers, and Performance . . . . . . . . . . . . . 102.3 Dwarfs, Patterns, and Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 The Berkeley View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 The Case for Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 The Case for Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . .


Loading Unlocking...
Login

Join to view Auto-tuning Performance on Multicore Computers and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Auto-tuning Performance on Multicore Computers and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?