DOC PREVIEW
UT CS 378 - An Experimental Comparison

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

An Experimental Comparison ofCache-oblivious and Cache-conscious Programs∗Kamen Yotov, Tom Roeder, Keshav PingaliCornell University{kyotov,tmroeder,pingali}@cs.cornell.eduJohn Gunnels, Fred GustavsonIBM T. J. Watson Research Center{gunnels,fg2}@us.ibm.comABSTRACTCache-oblivious algorithms have been advanced as a way ofcircumventing some of the difficulties of optimizing applica-tions to take advantage of the memory hierarchy of mod-ern microprocessors. These algorithms are based on thedivide-and-conquer paradigm – each division step createssub-problems of smaller size, and when the working set of asub-problem fits in some level of the memory hierarchy, thecomputations in that sub-problem can be executed withoutsuffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels ofthe memory hierarchy; in fact, for problems like matrix mul-tiplication, matrix transpose, and FFT, these recursive al-gorithms are optimal to within constant factors for sometheoretical models of the memory hierarchy.An important question is the following: how well do care-fully tuned cache-oblivious programs perform compared tocarefully tuned cache-conscious programs for the same prob-lem? Is there a price for obliviousness, and if so, how muchperformance do we lose? Somewhat surprisingly, there arefew studies in the literature that have addressed this ques-tion.This paper reports the results of such a study in thedomain of dense linear algebra. Our main finding is thatin this domain, even highly optimized cache-oblivious pro-grams perform significantly worse than corresponding cache-conscious programs. We provide insights into why this is so,and suggest research directions for making cache-obliviousalgorithms more competitive.Categories and Subject DescriptorsD.3.4 [Programming Languages]: compilers, code gener-ation, optimization; G.4 [Mathematical Software]∗This work is supported in part by NSF grants 0615240,0541193, 0509307, 0509324 and 0406380, as well as grantsfrom the IBM and Intel CorportationsPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SPAA’07, June 9–11, 2007, San Diego, California, USA.Copyright 2007 ACM 978-1-59593-667-7/07/0006 ...$5.00.G eneral TermsAlgorithms, Measurement, Performance, ExperimentationKeywordsMemory hierarchy, Memory Latency, Memory bandwidth,Cache-oblivious algorithms, Cache-conscious algorithms, Nu-merical Software1. INTRODUCTIONThe contributions of this paper are the following.• We present detailed experiments on a number of high-performance platforms that show that even highly op-timized recursive cache-oblivious programs may per-form significantly worse than highly optimized cache-conscious programs for the same problem.• We argue that part of the performance problem arisesbecause the schedule of operations in recursive codesmay be sub-optimal for exploiting processor pipelines.We show that the schedule of operations in iterativecodes can make better use of processor pipelines.• We argue that the rest of the performance problemarises from memory latency. Using analytical models,we point out that cache blocking serves two purposes:it can reduce the effective latency of memory requestsand it can reduce the bandwidth required from mem-ory. We argue quantitatively that I/O optimal algo-rithms [20, 25] may ameliorate the bandwidth prob-lem, but their performance may still suffer from mem-ory latency. In highly tuned iterative cache-consciouscodes, the effective latency of memory requests is re-duced by pre-fetching. We believe that this is neededin cache-oblivious programs as well.1.1 Memory Hierarchy ProblemThe performance of many programs on modern computersis limited by the performance of the memory system in twoways. First, the latency of memory accesses can be manyhundreds of cycles, so the processor may be stalled most ofthe time, waiting for loads to complete. Second, the band-width from memory is usually far less than the rate at whichthe processor can consume data.Both problems can be addressed by using caching – if mostmemory requests are satisfied by some cache level, the effec-tive memory latency as well as the bandwidth required frommemory are reduced. As is well known, the effectiveness of93caching for a problem depends both on the algorithm usedto solve the problem, and on the program used to expressthat algorithm (simply put, an algorithm defines only thedataflow of the computation, while a program for a givenalgorithm also specifies the schedule of operations and mayperform storage allocation consistent with that schedule).One useful quantity in thinking about these issues is algo-rithmic data r euse, which is an abstract measure of the num-ber of accesses made to a typical memory location by thealgorithm. For example, the standard algorithm for mul-tiplying matrices of size n × n performs O(n3)operationson O(n2) data, so it has excellent algorithmic data reusesince each data element is accessed O (n) times; in contrast,matrix transpose performs O(n2)operationsonO(n2) data,so it has poor algorithmic data reuse. When an algorithmhas substantial algorithmic data reuse, the challenge is towrite the program so that the memory accesses made bythat program exhibit both spatial and temporal locality. Incontrast, programs that encode algorithms with poor algo-rithmic data reuse must be written so that spatial localityis exploited.1.2 Programming stylesTwo programming styles are common in the domain ofdense linear algebra: iterative and recursive.In the iterative programming style, computations are im-plemented as nested loops. It is well known that na¨ıve pro-grams written in this style exhibit poor temporal localityand do not exploit caches effectively. Temporal locality canbe improved by tiling the loops either manually or with re-structuring compilers [22, 28]. The resulting program can beviewed as a computation over block matrices; tile sizes mustbe chosen so that the working set of each block computationfits in cache [12, 24]. If there are multiple cache levels, itmay be necessary to tile for each one. Tiling for


View Full Document

UT CS 378 - An Experimental Comparison

Documents in this Course
Epidemics

Epidemics

31 pages

Discourse

Discourse

13 pages

Phishing

Phishing

49 pages

Load more
Download An Experimental Comparison
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An Experimental Comparison and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Experimental Comparison 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?