DOC PREVIEW
UMD CMSC 411 - Lecture 9 Instruction Level Parallelism

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 411CMSC 411Computer Systems ArchitectureLecture 9Instruction Level Parallelism (cont.)Instruction Level Parallelism (cont.)Alan Sussmanl@ d [email protected]• HW #2, on pipelining, due today•Finish reading Chapter 2 of H&P•Finish reading Chapter 2 of H&P• First exam scheduled for next Thursday, March 5– on Units 1-3Wliill b i i it–Wanliwill be giving itCMSC 411 - 9 (from Patterson)2Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 9 (from Patterson)3Correlated Branch Prediction• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch hi t t blhistory table•In general, (m,n) predictor means record last m branches to g()pselect between 2mhistory tables, each with n-bit counters– Thus, old 2-bit BHT is a (0,2) predictor– Global Branch History: m-bit shift register keeping T/NT status of last mbranches.– Each entry in table has 2mn-bit predictors.if (aa == 2)aa = 0;if (bb == 2)bb=0;CMSC 411 - 8 (from Patterson)bb 0;if (aa != bb) {4Correlating Branches(2,2) predictor w/ 4– Behavior of recent branches selects between four Branch address4predictions of next branch, updating just that predictionPrediction1 0Or, 4 addr bits + 2 history bits give us 6-bit index into 26= 64 predictors, each having two bits Î 128 total bits.Global branch historyCMSC 411 - 8 (from Patterson)2-bits per branch predictor5Calculations• 4096-entry 2-bit BHT–4k x 2 = 8k bits– 4k = 212Î 12 address bits• How to use the same # bits w/ a (2,2) predictor?– 8k bits w/ 2-bit BHT means 4k BHTs–the (2, 2) implies an entry has four BHTsthe (2, 2) implies an entry has four BHTs– Î 1k entries, i.e. a (2,2) predictor w/ 1024 entriesCMSC 411 - 8 (from Patterson)6Tournament Predictors• Multilevel branch predictor• Use n-bit saturating counter to choose between predictors• Usually choicebetween globaldl l diand local predictorsPredictor 1 correct.Predictor 2 incorrect.CMSC 411 - 8 (from Patterson)7Tournament PredictorsTournament predictor using, say, 4K 2-bit counters id db l lb h dd Ch btindexed by local branch address. Chooses between:• Global predictor– 4K entries indexed by history of last 12 branches (212 = 4K)– Each entry is a standard 2-bit predictor• Local predictor– Local history table: 1024 10-bit entries recording last 10 branches, index by branch address– The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating countersCMSC 411 - 8 (from Patterson)g8Comparing Predictors (H&P Fig. 2.8)• Advantage of tournament predictor is ability to select the right predictor for a particular branchf–Particularly crucial for integer benchmarks. – A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarksless than 15% of the time for the SPEC FP benchmarksCMSC 411 - 8 (from Patterson)9Pentium 4 Misprediction Rate (per 1000 instructions, not per branch)(pe 000 st uct o s, ot pe b a c )1312121314ons≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)≈2% misprediction rate per branch SPECfp11991011000 Instructio≈2% misprediction rate per branch SPECfp(5% of FP instructions are branch)H&P Figure 2.2875678ictions per 10H&P Figure 2.28345anch mispredi1000012prccfyemduaBraCMSC 411 - 8 (from Patterson)164.gzip175.vpr176.gcc181.mcf186.crafty168.wupwise171.swim172.mgrid173.applu177.mesaSPECint2000 SPECfp200010Branch Target Buffers (BTB)• Branch target calculation is costly and stalls the instruction fetch.• BTB stores PCs the same way as caches• The PC of a branch is sent to the BTBWh t h i f d th di P di t d•When a match is found the corresponding Predicted PC is returned• If the branch was predicted taken, instruction fetch p,continues at the returned predicted PCCMSC 411 - 8 (from Patterson)11Branch Target BuffersH&P Figure 2.22CMSC 411 - 8 (from Patterson)12Dynamic Branch Prediction Summary• Prediction becoming important part of execution•Branch History Table: 2 bits for loop accuracyBranch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches correlated with next branchEith diff t b h (GA)–Either different branches (GA)– Or different executions of same branches (PA)• Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and combining them with a selector–In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5 and Pentium 4• Branch Target Buffer: include branch address & CMSC 411 - 8 (from Patterson)prediction13Outline• ILP• Compiler techniques to increase ILPpq• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic SchedulingScheduling• (Start) Tomasulo Algorithm•ConclusionConclusionCMSC 411 - 8 (from Patterson)14Advantages of Dynamic Scheduling• Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and ti b h iexception behavior• Handles cases when dependences unknown at compile time – it allows the processor to tolerate unpredictable delays such as cache i b ti th d hil iti f th i t lmisses, by executing other code while waiting for the miss to resolve• Allows code that compiled for one pipeline to run efficiently on a different pipeline • Simplifies the compiler • Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (later)gy g()CMSC 411 - 8 (from Patterson)15HW Schemes: Instruction Parallelism• Key idea: Allow instructions behind stall to proceedDIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14• Enables out-of-order execution and allows out-of-order completion()(e.g., SUBD)– In a dynamically scheduled pipeline, all instructions still pass through issue stagein order (in-order issue)Willdi ti i hhit tibitidhit•Willdistinguishwhen aninstructionbegins execution andwhenitcompletes execution; between 2 times, the instruction is inexecutionNtDititWARdWAWhdd•Note:Dynamic execution


View Full Document

UMD CMSC 411 - Lecture 9 Instruction Level Parallelism

Documents in this Course
Load more
Download Lecture 9 Instruction Level Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 9 Instruction Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 9 Instruction Level Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?