New version page

LECTURE NOTES

Upgrade to remove ads

This preview shows page 1-2 out of 6 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Main PageISLPED'05Front MatterTable of ContentsAuthor IndexSynonymous Address Compaction forEnergy Reduction in Data TLBChinnakrishnan S. [email protected] S. [email protected] Prvulovic†[email protected] of Electrical and Computer EngineeringCollege of Computing†Georgia Institute of Technology, Atlanta, GA 30332ABSTRACTModern processors can issue and execute multiple instruc-tions per cycle, often performing multiple memory opera-tions simultaneously. To reduce stalls due to resource con-flicts, most processors employ multi-ported L1 caches andTLBs to enable concurrent memory accesses. In this pa-per, we observe that data TLB lookups within a cycle andacross consecutive cycles are often synonymous —theygoto the same page. To exploit this finding, we propose twonew mechanisms — intra-cycle compaction and inter-cyclecompaction of address translation requests in order to saveenergy in the data TLB. Our results show that average en-ergy savings of 27% using intra-cycle, 42% using inter-cyclein a conventional d-TLB, and 56% using inter-cycle com-paction in semantic-aware d-TLBs can be achieved. Whenthese 2 compaction techniques are combined together andapplied to both the i-TLB and semantic-aware d-TLBs, anaverage energy savings of 76% (up to 87%) is obtained.Categories and Subject DescriptorsB.3.2 [Memory Structures]: Design Styles—Associativememories, Cache memories, Virtual memory.General TermsDesign, Experimentation, Performance.KeywordsLow-power TLB, Spatial and temporal locality, Multi-porting.1. INTRODUCTIONMulti-issue superscalar processors have become the defacto standard not only for high performance computingbut also in embedded computing platforms. These sophis-ticated processors issue and execute multiple instructionsper cycle and rely on accurate branch predictors, mul-tiple address generation units (AGU), and multi-portedPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISLPED’05, August 8–10, 2005, San Diego, California, USACopyright 2005 ACM 1-59593-137-6/05/0008 ...$5.00.translation lookaside buffers (TLBs) and caches to keepthe processor supplied with instructions and data. Withvirtual memory support, address translation must also bedone for instruction and data fetches. Due to the differ-ent needs of virtual memory management and cache co-herency maintenance, most caches are either physically in-dexed and physically tagged (PIPT), or virtually indexedand physically tagged (VIPT). In both cases, an addresstranslation using the TLB is needed for each access. Mul-tiple instructions issued in each cycle require multi-portedinstruction TLBs and data TLBs to avoid stalls due toresource conflicts. Additionally, TLBs are typically or-ganized as a fully-associative cache to eliminate memoryintensive page walks due to conflict misses. As a result,TLBs are often implemented as content-addressable mem-ory (CAM), where all CAM cells are probed and comparedto find a match each time a TLB access is initiated. Mea-sured data [5, 6] from commercial processors such as Intel’sStrongARM and Hitachi’s SH-3 indicates that as much as17% on-chip power is consumed in the TLBs with an esca-lating trend.In this paper, we analyze the access pattern of mem-ory operations performed within a cycle and in successivecycles, and exploit the characteristics of the addresses forenergy reduction opportunities In particular, we find thatconcurrent and consecutive memory operations demonstratevery high locality and are often synonymous — accessingthe same memory page. As a result, a single TLB lookupoften suffices to find the correct translation for multipleaccesses in the same cycle, thus eliminating redundantlook-ups that could draw additional power. Similarly, themost recently accessed data TLB entry can be latched andreused in subsequent TLB look-ups. We propose two newhardware-based mechanisms that exploit this behavior toreduce the number of TLB lookups. These mechanismsare complexity-effective and power-efficient, with minimalimpact on the hardware budget. In addition to reducingpower, these mechanisms can also be used, when chip areais a concern, to reduce the number of TLB ports.The rest of this paper is organized as follows: Section 2motivates our work by characterizing data memory refer-ences, Section 3 presents intra-cycle and inter-cycle addresscompaction mechanisms, Section 4 presents our simulationresults, Section 5 discusses related work, and Section 6presents our conclusions.2. MOTIVATIONUsing MiBench [3] and SPEC CPU2000 benchmarks,Figure 1 shows that more than 40% of dynamic instruc-35739.2%41.4%0%10%20%30%40%50%60%70%80%blowfishbitcountcjpegdijkstradjpegfftpatriciarijndaelMiBench Avgartbzip2gccgzipmcfparserperlbmkSPEC AvgMiBench SPEC 2000Figure 1: Dynamic memory references as a fractionof all dynamic instructionstions executed in a program are memory references,1inother words, there is a memory instruction for almost everyother instruction issued. Therefore, in a superscalar pro-cessor, multiple memory operations will be performed con-currently in each cycle, using multiple AGUs, larger mem-ory order buffers for address disambiguation, and multi-ported TLBs and caches. To study the behavior of mem-ory references in a given cycle (intra-cycle) and in consec-utive cycles (inter-cycle), we examine the distribution ofdynamic memory accesses.22.1 Intra-cyclebehaviorofmemoryreferences0%10%20%30%40%50%60%70%80%90%100%blowfishbitcountcjpegdijkstradjpegfftpatriciarijndaelartbzipgccgzipparserperlbmkAvg% of data TLB accesses1 dtlb access / clk 2 dtlb accesses / clk 3 dtlb accesses / clk 4 dtlb accesses / clkFigure 2: Breakdown of d-TLB accessesFigure 2 shows the breakdown of data TLB accesses ac-cording to the number of concurrent references per cycle.On average for 58% of accesses, the processor issues morethan one data TLB lookup in a cycle, that requires a multi-ported TLB to avoid stalls. An interesting property of1We ran all the benchmark programs to completion ex-cept for art which stopped at 500 billion instructions. ForSPEC2000, reference inputs were used. We


Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?