Administrivia HW 2 on pipelining due today Finish reading Chapter 2 of H P First exam scheduled for next Thursday March 5 CMSC 411 Computer Systems Architecture Lecture 9 Instruction Level Parallelism cont on Units 1 3 Wanli W li will ill be b giving i i it Alan Sussman als cs umd edu l d d 2 CMSC 411 9 from Patterson Outline Correlated Branch Prediction Idea record m most recently executed branches as taken or not taken and use that pattern to select the proper n bit branch hi t history ttable bl ILP p techniques q to increase ILP Compiler Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling Tomasulo Algorithm Conclusion In g general m n p predictor means record last m branches to select between 2m history tables each with n bit counters Thus old 2 bit BHT is a 0 2 predictor Global Branch History m bit shift register keeping T NT status of last m branches Each entry in table has 2m n bit predictors if aa 2 aa 0 if bb 2 bb 0 if aa bb CMSC 411 9 from Patterson 3 CMSC 411 8 from Patterson 4 Correlating Branches Calculations 4096 entry 2 bit BHT 2 2 predictor w 4k x 2 8k bits 4 Behavior of recent branches selects between four predictions of next branch updating just that prediction 4k 212 12 address bits Branch address How to use the same bits w a 2 2 predictor 8k bits w 2 bit BHT means 4k BHTs Prediction the 2 2 implies an entry has four BHTs 1k entries i e a 2 2 predictor w 1024 entries Or 4 addr bits 2 history bits give us 6 bit index into 26 64 predictors each having two bits 128 total bits 1 0 Global branch history 2 bits per branch predictor CMSC 411 8 from Patterson 5 CMSC 411 8 from Patterson 6 Tournament Predictors Tournament Predictors Multilevel branch predictor Tournament predictor using say 4K 2 bit counters i d indexed db by llocall b branch h address dd Ch Chooses b between t Use n bit saturating counter to choose between predictors Global predictor 4K entries indexed by history of last 12 branches 212 4K Usually choice between global and d llocall predictors di Each entry is a standard 2 bit predictor Local predictor Local history table 1024 10 bit entries recording last 10 branches index by branch address The pattern of the last 10 occurrences of that particular branch used g counters to index table of 1K entries with 3 bit saturating Predictor 1 correct Predictor 2 incorrect CMSC 411 8 from Patterson 7 CMSC 411 8 from Patterson 8 Pentium 4 Misprediction Rate per 1000 pe 000 instructions st uct o s not ot pe per b branch a c Comparing Predictors H P Fig 2 8 14 Advantage of tournament predictor is ability to select the right predictor for a particular branch 13 Bra anch misprediictions per 10 000 Instructio ons 13 Particularly crucial for f integer benchmarks A typical tournament predictor will select the global predictor almost 40 of the time for the SPEC integer benchmarks and less than 15 of the time for the SPEC FP benchmarks 6 misprediction rate per branch SPECint 19 of INT instructions are branch 12 12 11 2 misprediction rate per branch SPECfp 5 of FP instructions are branch 11 10 9 9 8 H P Figure 2 28 7 7 6 5 5 4 3 2 1 1 0 0 0 CMSC 411 8 from Patterson 9 Branch Target Buffers BTB m es a 17 7 17 3 ap pl u w im SPECfp2000 10 CMSC 411 8 from Patterson Branch Target Buffers Branch target calculation is costly and stalls the instruction fetch BTB stores PCs the same way as caches The PC of a branch is sent to the BTB When Wh a match t h iis ffound d th the corresponding di P Predicted di t d PC is returned If the branch was p predicted taken instruction fetch continues at the returned predicted PC CMSC 411 8 from Patterson 17 2 m gr id e 17 1 s af ty up w is 16 8 w SPECint2000 18 6 cr 18 1 m cf 17 6 gc c 17 5 vp r 16 4 gz ip 0 H P Figure 2 22 11 CMSC 411 8 from Patterson 12 Dynamic Branch Prediction Summary Outline Prediction becoming important part of execution Branch History Table 2 bits for loop accuracy Correlation Recently executed branches correlated with next branch Either Eith diff differentt b branches h GA Or different executions of same branches PA Tournament predictors take insight to next level by using multiple predictors usually one based on global information and one based on local information and combining them with a selector In 2006 tournament predictors using 30K bits are in processors like the Power5 and Pentium 4 ILP Compiler p techniques q to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling Start Tomasulo Algorithm Conclusion Branch Target Buffer include branch address prediction CMSC 411 8 from Patterson 13 Advantages of Dynamic Scheduling Dynamic scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception ti b behavior h i 14 HW Schemes Instruction Parallelism Key idea Allow instructions behind stall to proceed DIVD ADDD SUBD Handles cases when dependences unknown at compile time F0 F2 F4 F10 F0 F8 F12 F8 F14 Enables out of order execution and allows out of order completion e g SUBD it allows the processor to tolerate unpredictable delays such as cache misses i b by executing ti other th code d while hil waiting iti ffor th the miss i tto resolve l Allows code that compiled for one pipeline to run efficiently on a different pipeline In a dynamically scheduled pipeline all instructions still pass through issue stage in order in order issue Will distinguish di ti i h when h an instruction i t ti begins b i execution ti and d when h it completes execution between 2 times the instruction is in execution Simplifies the compiler Hardware speculation a technique with significant performance g builds on dynamic y scheduling g later advantages CMSC 411 8 from Patterson CMSC 411 8 from Patterson N Note t Dynamic D i execution ti creates t WAR and d WAW hazards h d and d makes exceptions harder 15 CMSC 411 8 from Patterson 16 Dynamic Scheduling Step 1 A Dynamic Algorithm Tomasulo s For IBM 360 91 before caches Simple pipeline had 1 stage to check both structural and data hazards Instruction Decode ID also called Instruction Issue Long memory latency Goal High Performance without special compilers S Smallll number b off flfloating ti point i t registers i t 4 iin 360 prevented t d interesting compiler scheduling of operations Split the ID pipe stage of simple 5 stage pipeline into 2 stages This led Tomasulo to try to figure out how to get more effective
View Full Document