UT CS 378 - An Experimental Comparison - D2573999

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> An Experimental Comparison

DOC PREVIEW

UT CS 378 - An Experimental Comparison

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

An Experimental Comparison ofCache-oblivious and Cache-conscious Programs∗Kamen Yotov, Tom Roeder, Keshav PingaliCornell University{kyotov,tmroeder,pingali}@cs.cornell.eduJohn Gunnels, Fred GustavsonIBM T. J. Watson Research Center{gunnels,fg2}@us.ibm.comABSTRACTCache-oblivious algorithms have been advanced as a way ofcircumventing some of the diﬃculties of optimizing applica-tions to take advantage of the memory hierarchy of mod-ern microprocessors. These algorithms are based on thedivide-and-conquer paradigm – each division step createssub-problems of smaller size, and when the working set of asub-problem ﬁts in some level of the memory hierarchy, thecomputations in that sub-problem can be executed withoutsuﬀering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels ofthe memory hierarchy; in fact, for problems like matrix mul-tiplication, matrix transpose, and FFT, these recursive al-gorithms are optimal to within constant factors for sometheoretical models of the memory hierarchy.An important question is the following: how well do care-fully tuned cache-oblivious programs perform compared tocarefully tuned cache-conscious programs for the same prob-lem? Is there a price for obliviousness, and if so, how muchperformance do we lose? Somewhat surprisingly, there arefew studies in the literature that have addressed this ques-tion.This paper reports the results of such a study in thedomain of dense linear algebra. Our main ﬁnding is thatin this domain, even highly optimized cache-oblivious pro-grams perform signiﬁcantly worse than corresponding cache-conscious programs. We provide insights into why this is so,and suggest research directions for making cache-obliviousalgorithms more competitive.Categories and Subject DescriptorsD.3.4 [Programming Languages]: compilers, code gener-ation, optimization; G.4 [Mathematical Software]∗This work is supported in part by NSF grants 0615240,0541193, 0509307, 0509324 and 0406380, as well as grantsfrom the IBM and Intel CorportationsPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SPAA’07, June 9–11, 2007, San Diego, California, USA.Copyright 2007 ACM 978-1-59593-667-7/07/0006 ...$5.00.G eneral TermsAlgorithms, Measurement, Performance, ExperimentationKeywordsMemory hierarchy, Memory Latency, Memory bandwidth,Cache-oblivious algorithms, Cache-conscious algorithms, Nu-merical Software1. INTRODUCTIONThe contributions of this paper are the following.• We present detailed experiments on a number of high-performance platforms that show that even highly op-timized recursive cache-oblivious programs may per-form signiﬁcantly worse than highly optimized cache-conscious programs for the same problem.• We argue that part of the performance problem arisesbecause the schedule of operations in recursive codesmay be sub-optimal for exploiting processor pipelines.We show that the schedule of operations in iterativecodes can make better use of processor pipelines.• We argue that the rest of the performance problemarises from memory latency. Using analytical models,we point out that cache blocking serves two purposes:it can reduce the eﬀective latency of memory requestsand it can reduce the bandwidth required from mem-ory. We argue quantitatively that I/O optimal algo-rithms [20, 25] may ameliorate the bandwidth prob-lem, but their performance may still suﬀer from mem-ory latency. In highly tuned iterative cache-consciouscodes, the eﬀective latency of memory requests is re-duced by pre-fetching. We believe that this is neededin cache-oblivious programs as well.1.1 Memory Hierarchy ProblemThe performance of many programs on modern computersis limited by the performance of the memory system in twoways. First, the latency of memory accesses can be manyhundreds of cycles, so the processor may be stalled most ofthe time, waiting for loads to complete. Second, the band-width from memory is usually far less than the rate at whichthe processor can consume data.Both problems can be addressed by using caching – if mostmemory requests are satisﬁed by some cache level, the eﬀec-tive memory latency as well as the bandwidth required frommemory are reduced. As is well known, the eﬀectiveness of93caching for a problem depends both on the algorithm usedto solve the problem, and on the program used to expressthat algorithm (simply put, an algorithm deﬁnes only thedataﬂow of the computation, while a program for a givenalgorithm also speciﬁes the schedule of operations and mayperform storage allocation consistent with that schedule).One useful quantity in thinking about these issues is algo-rithmic data r euse, which is an abstract measure of the num-ber of accesses made to a typical memory location by thealgorithm. For example, the standard algorithm for mul-tiplying matrices of size n × n performs O(n3)operationson O(n2) data, so it has excellent algorithmic data reusesince each data element is accessed O (n) times; in contrast,matrix transpose performs O(n2)operationsonO(n2) data,so it has poor algorithmic data reuse. When an algorithmhas substantial algorithmic data reuse, the challenge is towrite the program so that the memory accesses made bythat program exhibit both spatial and temporal locality. Incontrast, programs that encode algorithms with poor algo-rithmic data reuse must be written so that spatial localityis exploited.1.2 Programming stylesTwo programming styles are common in the domain ofdense linear algebra: iterative and recursive.In the iterative programming style, computations are im-plemented as nested loops. It is well known that na¨ıve pro-grams written in this style exhibit poor temporal localityand do not exploit caches eﬀectively. Temporal locality canbe improved by tiling the loops either manually or with re-structuring compilers [22, 28]. The resulting program can beviewed as a computation over block matrices; tile sizes mustbe chosen so that the working set of each block computationﬁts in cache [12, 24]. If there are multiple cache levels, itmay be necessary to tile for each one. Tiling for

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 12 pages.

UT CS 378 - An Experimental Comparison

Sign up for free to view:

Please select your school