UT CS 378 - Is Search Really Necessary to Generate High-Performance BLAS - D2412086

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Is Search Really Necessary to Generate High-Performance BLAS

DOC PREVIEW

UT CS 378 - Is Search Really Necessary to Generate High-Performance BLAS

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Is Search Really Necessary to GenerateHigh-Performance BLAS?Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran,David Padua, Keshav Pingali, Paul StodghillAbstract—A key step in program optimization is the estimationof optimal values for parameters such as tile sizes and loopunrolling factors. Traditional compilers use simple analyticalmodels to compute these values. In contrast, library generatorslike ATLAS use global search over the space of parameter valuesby generating programs with many different combinations ofparameter values, and running them on the actual hardwareto determine which values give the best performance. It iswidely believed that traditional model-driven optimization can-not compete with search-based empirical optimization becausetractable analytical models cannot capture all the complexitiesof modern high-performance architectures, but few quantitativecomparisons have been done to date.To make such a comparison, we replaced the global searchengine in ATLAS with a model-driven optimization engine,and measured the relative performance of the code producedby the two systems on a variety of architectures. Since bothsystems use the same code generator, any differences in theperformance of the code produced by the two systems cancome only from differences in optimization parameter values.Our experiments show that model-driven optimization can besurprisingly effective, and can generate code with performancecomparable to that of code generated by ATLAS using globalsearch.Index Terms—program optimization, empirical optimization,model-driven optimization, compilers, library generators, BLAS,high-performance computingI. INTRODUCTIONThe sciences do not try to explain, they hardlyeven try to interpret, they mainly make models. Bya model is meant a mathematical construct which,with the addition of certain verbal interpretations,describes observed phenomena. The justification ofsuch a mathematical construct is solely and preciselythat it is expected to work.John Von NeumannIt is a fact universally recognized that current restructuringcompilers do not generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrixmultiplication. This inadequacy of current compilers doesnot stem from a lack of technology for transforming high-level programs into programs that run efficiently on modernhigh-performance architectures; over the years, the compilercommunity has invented innumerable techniques such as linearloop transformations [5,11,14, 29,42], loop tiling [27, 28,This work was supported by NSF grants ACI-9870687, EIA-9972853, ACI-0085969, ACI-0090217, ACI-0103723, and ACI-012140. K. Yotov, K. Pingaliand P. Stodghill are with Cornell University; X. Li, G. Ren, M. Garzaran andD. Padua are with University of Illinois at Urbana-Champaign43] and loop unrolling [4, 32] for enhancing locality andparallelism. Other work has focused on algorithms for esti-mating optimal values for parameters associated with thesetransformations, such as tile sizes [7, 13,36] and loop unrollfactors [4]. Nevertheless, performance-conscious programmersmust still optimize their programs manually [15,19].The simplest manual approach to tuning a program for agiven platform is to write different versions of that program,evaluate the performance of these versions on the target plat-form, and select the one that gives the best performance. Thesedifferent versions usually implement the same algorithm, butdiffer in the values they use for parameters such as tilesizes and loop unroll factors. The architectural insights anddomain knowledge of the programmer are used to limit thenumber of versions that are evaluated. In effect, the analyticaltechniques used in current compilers to derive optimal valuesfor such parameters are replaced by an empirical search overa suitably restricted space of parameter values (by empiricalsearch, we mean a three step process: (1) generating a versionof the program corresponding to each combination of theparameters under consideration, (2) executing each version onthe target machine and measuring its performance, and (3)selecting the version that performs best). This approach hasbeen advocated most forcefully by Fred Gustavson and his co-workers at IBM, who have used it for many years to generatethe highly optimized ESSL and PESSL libraries for IBM ma-chines [34]. Recently, a number of projects such as FFTW [17,18], PhiPAC [2,6], ATLAS [1, 41], and SPIRAL [26,33] haveautomated the generation of the different program versionswhose performance must be evaluated. Experience shows thatthese library generators produce much better code than nativecompilers do on modern high-performance architectures.Our work was motivated by a desire to understand theperformance gap between the BLAS codes produced by AT-LAS and by restructuring compilers, with the ultimate goalof improving the state of the art of current compilers. Onereason why compilers might be at a disadvantage is that theyare general-purpose and must be able to optimize any program,whereas a library generator like ATLAS can focus on a partic-ular problem domain. However, this is somewhat implausiblebecause dense numerical linear algebra, the particular problemdomain of ATLAS, is precisely the area that has been studiedmost intensely by the compiler community, and there is anextensive collection of well-understood transformations foroptimizing dense linear algebra programs. Another reason forthe inadequacy of current compilers might be that new trans-formations, unknown to the compiler community, are required2ATLAS Search Engine(mmsearch)Execute&MeasureATLAS MM Code Generator(mmcase)mini-MMMSourceLSMFLOPSMU, NU, KUNBFF, IF, NFFMAModel Parameter Estimator(mmmodel)ATLAS MM Code Generator(mmcase)LSMU, NU, KUNBFF, IF, NFFMAL1SizeNRLSFMADetectHardwareParametersmini-MMMSourceL1SizeNRL*, |ALUFP|FMADetectHardwareParametersL1 I-CacheFig. 1. Architecture of ATLAS and of Model-driven ATLASto produce code of the same quality as the code produced byATLAS. Finally, it is possible that the analytical models usedby compilers to estimate optimal values for transformationparameters are overly simplistic, given the complex hardwareof modern computers, so they are not able to produce goodvalues for program optimization parameters.No definitive studies exist to settle these matters. Ourresearch is the first quantitative study of these issues.Figure 1 shows our experimental set-up, which makes useof the original

View Full Document