GT CS 8803 - The Influence of Caches on the Performance of Heaps - D303534

Home> Schools> Georgia Tech> Computer Science (CS) > CS 8803> The Influence of Caches on the Performance of Heaps

DOC PREVIEW

GT CS 8803 - The Influence of Caches on the Performance of Heaps

School name Georgia Tech

Course Cs 8803- Special Topics

Pages 26

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 26 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The Influence of Caches on the Performance of Heaps TR # UW-CSE-96-02-03Anthony LaMarcaXerox PARCRichard E. LadnerUniversity of WashingtonSeattle, WA 98195January 6, 1997AbstractAs memory access times grow larger relative to processor cycle times, the cache performance of al-gorithms has an increasingly large impact on overall performance. Unfortunately, most commonly usedalgorithms were not designed with cache performance in mind. This paper investigates the cache perfor-mance of implicit heaps. We present optimizations which significantly reduce the cache misses that heapsincurandimprovetheir overallperformance. We presentan analytical model called collectiveanalysisthatallows cache performance to be predicted as a function of both cache configuration and algorithm config-uration. As part of our investigation, we perform an approximate analysis of the cache performance ofboth traditional heaps and our improved heaps in our model. In addition empirical data is given for fivearchitectures to show the impact our optimizationshave on overall performance. We also revisit a priorityqueue study originally performed by Jones [25]. Due to the increases in cache miss penalties, the rela-tive performance results we obtain on today’s machines differ greatly from the machines of only ten yearsago. We compare the performance of implicit heaps, skew heaps and splay trees and discuss the differencebetween our results and Jones’s.1 IntroductionThe timeto service a cache miss to memory has grown from 6 cycles for the Vax 11/780to 120 for the AlphaServer 8400[10, 18]. Cache miss penalties have grown to the point where good overall performance cannot be achieved withoutgood cache performance. Unfortunately, many fundamental algorithms were developed without considering caching.In this paper, we investigate and optimize the cache performance of implicit heaps. As part of this study, we introducecollective analysis, a framework we have developed that allows an algorithm’s cache performance to be predicted fordirect mapped caches. Our hope is that algorithm designerswill use models like collective analysis in conjunctionwithtraditional analysis techniques in order to better understand how their algorithms perform on real machines.1.1 MethodologyThis paper is a study of the performance of algorithms. We examine the way programs are analyzed and optimizedfor performance. In this section we present the methodology we use in our performance study and contrast it with This paper appears in The ACM Journal of ExperimentalAlgorithmics. The work of LaMarca was performed at the Universityof Washington and was partly supportedby an AT&T fellowship.Author’s addresses: A. LaMarca, Xerox PARC, 3333 Coyote Hill Road, Palo Alto CA 94304; email: [email protected];R. E. Ladner, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195; email:[email protected] that have been used in the past.The true test of performance is execution time and numerous researchers investigate algorithm performance bycomparing execution times. The drawback of simply implementing an algorithm and measuring its execution time,however, is that very little intuition is provided as to why the performance was either good or bad. As we will showin this paper, it is possible for one algorithm to exhibit good cache locality with high instructioncounts while a similaralgorithm exhibits just the opposite. While execution time is the final measure of performance, it is too coarse-graineda metric to use in the intermediate stages of analysis and optimization of algorithms.The majority of researchers in the algorithm community compare algorithm performance using analyses in a unit-cost model. The RAM model [4, 12] is used most commonly, and in this abstract architecture all basic operations,includingreads and writes tomemory, have unitcost. Unit-costmodels have theadvantage that they are simple, easy touse, and produce results that are easily compared. The big drawback is that unit-cost models do not reflect the memoryhierarchies present in modern computers. In the past, they may have been fair indicators of performance, but that is nolonger true.It is also common for the analyses of algorithms in a specific area to only count particular expensive operations.Analyses of searching algorithms, for example, typically only count the number of comparisons performed. The mo-tivation behind only counting expensive operations is a sound one. It simplifies the analysis yet retains accuracy sincethe bulk of the costs are captured. The problem with this approach is that shifts in technology can render the expen-sive operations inexpensive and vice versa. Such is the case with comparisons performed in searching algorithms. Onmodern machines comparing two registers is no more expensive than copying or adding them. In Section 4.1, we showhow this type of analysis can lead to incorrect conclusions regarding performance.In this paper, we employ an incremental approach to evaluating algorithm performance. Rather than discard exist-ing analyses and perform them again in a new comprehensive model, we leverage as much as we can off existing anal-yses and fill in where they are weak. An evolutionary approach should involve less work than a complete re-evaluationand it offers an easy transitionfor those interested in a more accurate analysis.One weakness of existing unit-costanalysis is that it fails to measure the cost of cache misses that algorithms incur.For this reason, we divide total execution time into two parts. The first part is the time the algorithm would spendexecuting in a system where cache misses do not occur. We refer to this as the instruction cost of the algorithm. Thesecond part is the time spent servicing the cache misses that do occur and we call this the memory overhead.We measure instruction cost with both analyses in a unit-cost model and with dynamic instruction counts fromexecutions. In this paper we employ both existing unit-cost analyses and some simple analyses of our own. Details ofour technique for instrumenting programs to produce dynamic instruction counts are described in Section 4.4. Neitherof these techniques will model instruction cost perfectly. Execution delays such as branch delay stalls and TranslationLookasideBuffer(TLB) misses arenotmeasured witheitherof thesemethods. Nordotheycapture variancein thecycletimes of different types of instructions. Despite these

View Full Document