UMD CMSC 411 - Lecture 9 Instruction Level Parallelism - D1695250

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 411> Lecture 9 Instruction Level Parallelism

DOC PREVIEW

UMD CMSC 411 - Lecture 9 Instruction Level Parallelism

School name University of Maryland, College Park

Course Cmsc 411- Computer Systems Architecture

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC 411CMSC 411Computer Systems ArchitectureLecture 9Instruction Level Parallelism (cont.)Instruction Level Parallelism (cont.)Alan Sussmanl@ d [email protected]• HW #2, on pipelining, due today•Finish reading Chapter 2 of H&P•Finish reading Chapter 2 of H&P• First exam scheduled for next Thursday, March 5– on Units 1-3Wliill b i i it–Wanliwill be giving itCMSC 411 - 9 (from Patterson)2Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 9 (from Patterson)3Correlated Branch Prediction• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch hi t t blhistory table•In general, (m,n) predictor means record last m branches to g()pselect between 2mhistory tables, each with n-bit counters– Thus, old 2-bit BHT is a (0,2) predictor– Global Branch History: m-bit shift register keeping T/NT status of last mbranches.– Each entry in table has 2mn-bit predictors.if (aa == 2)aa = 0;if (bb == 2)bb=0;CMSC 411 - 8 (from Patterson)bb 0;if (aa != bb) {4Correlating Branches(2,2) predictor w/ 4– Behavior of recent branches selects between four Branch address4predictions of next branch, updating just that predictionPrediction1 0Or, 4 addr bits + 2 history bits give us 6-bit index into 26= 64 predictors, each having two bits Î 128 total bits.Global branch historyCMSC 411 - 8 (from Patterson)2-bits per branch predictor5Calculations• 4096-entry 2-bit BHT–4k x 2 = 8k bits– 4k = 212Î 12 address bits• How to use the same # bits w/ a (2,2) predictor?– 8k bits w/ 2-bit BHT means 4k BHTs–the (2, 2) implies an entry has four BHTsthe (2, 2) implies an entry has four BHTs– Î 1k entries, i.e. a (2,2) predictor w/ 1024 entriesCMSC 411 - 8 (from Patterson)6Tournament Predictors• Multilevel branch predictor• Use n-bit saturating counter to choose between predictors• Usually choicebetween globaldl l diand local predictorsPredictor 1 correct.Predictor 2 incorrect.CMSC 411 - 8 (from Patterson)7Tournament PredictorsTournament predictor using, say, 4K 2-bit counters id db l lb h dd Ch btindexed by local branch address. Chooses between:• Global predictor– 4K entries indexed by history of last 12 branches (212 = 4K)– Each entry is a standard 2-bit predictor• Local predictor– Local history table: 1024 10-bit entries recording last 10 branches, index by branch address– The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating countersCMSC 411 - 8 (from Patterson)g8Comparing Predictors (H&P Fig. 2.8)• Advantage of tournament predictor is ability to select the right predictor for a particular branchf–Particularly crucial for integer benchmarks. – A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarksless than 15% of the time for the SPEC FP benchmarksCMSC 411 - 8 (from Patterson)9Pentium 4 Misprediction Rate (per 1000 instructions, not per branch)(pe 000 st uct o s, ot pe b a c )1312121314ons≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)≈2% misprediction rate per branch SPECfp11991011000 Instructio≈2% misprediction rate per branch SPECfp(5% of FP instructions are branch)H&P Figure 2.2875678ictions per 10H&P Figure 2.28345anch mispredi1000012prccfyemduaBraCMSC 411 - 8 (from Patterson)164.gzip175.vpr176.gcc181.mcf186.crafty168.wupwise171.swim172.mgrid173.applu177.mesaSPECint2000 SPECfp200010Branch Target Buffers (BTB)• Branch target calculation is costly and stalls the instruction fetch.• BTB stores PCs the same way as caches• The PC of a branch is sent to the BTBWh t h i f d th di P di t d•When a match is found the corresponding Predicted PC is returned• If the branch was predicted taken, instruction fetch p,continues at the returned predicted PCCMSC 411 - 8 (from Patterson)11Branch Target BuffersH&P Figure 2.22CMSC 411 - 8 (from Patterson)12Dynamic Branch Prediction Summary• Prediction becoming important part of execution•Branch History Table: 2 bits for loop accuracyBranch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches correlated with next branchEith diff t b h (GA)–Either different branches (GA)– Or different executions of same branches (PA)• Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and combining them with a selector–In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4• Branch Target Buffer: include branch address & CMSC 411 - 8 (from Patterson)prediction13Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• (Start) Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 8 (from Patterson)14Advantages of Dynamic Scheduling• Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and ti b h iexception behavior• Handles cases when dependences unknown at compile time – it allows the processor to tolerate unpredictable delays such as cache i b ti th d hil iti f th i t lmisses, by executing other code while waiting for the miss to resolve• Allows code that compiled for one pipeline to run efficiently on a different pipeline • Simplifies the compiler • Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (later)gy g()CMSC 411 - 8 (from Patterson)15HW Schemes: Instruction Parallelism• Key idea: Allow instructions behind stall to proceedDIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14• Enables out-of-order execution and allows out-of-order completion()(e.g., SUBD)– In a dynamically scheduled pipeline, all instructions still pass through issue stagein order (in-order issue)Willdi ti i hhit tibitidhit•Willdistinguishwhen aninstructionbegins execution andwhenitcompletes execution; between 2 times, the instruction is inexecutionNtDititWARdWAWhdd•Note:Dynamic execution

View Full Document