Unformatted text preview:

Systems I Pipelining V Topics Branch prediction State machine design Branch Prediction Until now we have assumed a predict taken strategy for conditional branches Compute new branch target and begin fetching from there If prediction is incorrect flush pipeline and begin refetching However there are other strategies Predict not taken Combination quasi static Predict taken if branch backward like a loop Predict not taken if branch forward 2 Fetch Logic Revisited During Fetch Cycle 1 Select PC 2 Read bytes from instruction memory 3 Examine icode to determine instruction length 4 Increment PC Timing Steps 2 4 require significant amount of time 3 Standard Fetch Timing need regids need valC Select PC Mem Read Increment 1 clock cycle Must Perform Everything in Sequence Can t compute incremented PC until know how much to increment it by 4 A Fast PC Increment Circuit incrPC High order 29 bits 0 Slow MUX carry Low order 3 bits 1 29 bit incrementer 3 bit adder 0 High order 29 bits Fast need regids need ValC Low order 3 bits PC 5 Modified Fetch Timing need regids need valC 3 bit add Select PC Mem Read MUX Incrementer Standard cycle 1 clock cycle 29 Bit Incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read 6 More Realistic Fetch Logic Other PC Controls Byte 0 Fetch Fetch Control Control Instr Instr Length Length Bytes 1 5 Current Current Instruction Instruction Instruction Instruction Cache Cache Current Block Next Block Fetch Box Integrated into instruction cache Fetches entire cache block 16 or 32 bytes Selects current instruction from current block Works ahead to fetch next block As reaches end of current block At branch target 7 Modern CPU Design Instruction Instruction Control Control Fetch Control Retirement Unit Address Instruction Instructions Cache Instruction Register File Decode Operations Register Updates Prediction OK Integer Branch General Integer FP Add Operation Results FP Mult Div Load Addr Store Functional Units Addr Data Data Data Cache Execution Execution 8 Instruction Control Instruction Instruction Control Control Retirement Unit Register File Fetch Control Address Instruction Instructions Cache Instruction Decode Operations Grabs Instruction Bytes From Memory Based on Current PC Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken not taken and possibly branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1 3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations 9 Execution Unit Register Updates Prediction OK Integer Branch General Integer Operations FP Add Operation Results FP Mult Div Load Addr Store Functional Units Addr Data Data Data Cache Execution Execution Multiple functional units Each can operate in independently Operations performed as soon as operands available Not necessarily in program order Within limits of functional units Control logic Ensures behavior equivalent to sequential program execution 10 CPU Capabilities of Intel iCore7 Multiple Instructions Can Execute in Parallel 1 load 1 store 1 FP multiplication or division 1 FP addition 1 integer operation Some Instructions Take 1 Cycle but Can be Pipelined Instruction Latency Cycles Issue Load Store 3 1 Integer Multiply 3 1 Integer Divide 11 21 5 13 Double Single FP Multiply 4 1 Double Single FP Add 3 1 Double Single FP Divide 10 15 6 11 11 iCore Operation Translates instructions dynamically into Uops 118 bits wide Holds operation two sources and destination Executes Uops with Out of Order engine Uop executed when Operands available Functional unit available Execution controlled by Reservation Stations Keeps track of data dependencies between uops Allocates resources 12 High Perforamnce Branch Prediction Critical to Performance Typically 11 15 cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns e g alternating taken not taken Handling BTB misses Detect in cycle 6 Predict taken for negative offset not taken for positive Loops vs conditionals 13 Branching Structures Predict not taken works well for top of the loop Loop cmpl eax branching structures But such loops have jumps at the bottom of the loop to return to the top of the loop and incur the jump stall overhead Out edx je Out 1nd loop instr last loop instr jmp Loop fall out instr Predict not taken doesn t work well for bottom of the loop branching structures Loop 1st loop instr 2nd loop instr last loop instr cmpl eax edx jne Loop fall out instr 14 Branch Prediction Algorithms Static Branch Prediction Prediction taken not taken either assumed or encoded into program Dynamic Branch Prediction Uses forms of machine learning in hardware to predict branches Track branch behavior Past history of individual branches Learn branch biases Learn patterns and correlations between different branches Can be very accurate 95 plus as compared to less than 90 for static 15 Simple Dynamic Predictor Predict branch based on past history of branch Branch history table Indexed by PC or fraction of it Each entry stores last direction that indexed branch went 1 bit to encode taken not taken Table is a cache of recent branches Buffer size of 4096 entries are common track 4K different branches PC IM BHT IR Prediction update 16 Problem with a single bit Imagine a heavily taken branch It will be taken taken taken but eventually not taken Its predictor state will be N not taken When we reexecute this branch most likely we are in a new loop iteration The branch is likely taken So mispredict on the first branch in the loop Then predict correctly for the entire loop But the exit condition will also be a mispredict Can we get down to 1 mispredict per loop execution 17 Enhanced Dynamic Predictor Replace simple table of 1 bit histories with table of 2 bit state bits State transition logic can be shared across all entries in table Read entry out Apply combinational logic Write updated state bits back into table PC IM BHT IR Prediction update 18 Multi bit predictors A predict same as last strategy gets two mispredicts on each loop for j 0 j 30 j Predict NTTT TTT Actual TTTT TTN Can do much better by adding inertia to the predictor e g two bit saturating counter Predict TTTT TTT Use two bits to encode Strongly taken T2 Weakly taken T1 Weakly not taken


View Full Document

UT CS 429H - Pipelining 5

Loading Unlocking...
Login

Join to view Pipelining 5 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Pipelining 5 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?