DOC PREVIEW
UCF CDA 5106 - Modern Processor Design: Fundamentals of Superscalar Processors

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Modern Processor Design: Fundamentals of Superscalar Processors35 Chapter 5Problem 1 through Problem 6The code below steps through the elements of two arrays (A[] and B[]) concurrently, and for each element, it puts the larger of the two values into the corresponding element of a third array (C[]). The three arrays are of length N.The instruction set used in this problem is as follows:add rd, rs, rt rd I rs + rtaddi rd, rs, imm rd I rs + immlw rd, offset(base) rd I MEM[offset+base] (offset = imm, base = reg)sw rs, offset(base) MEM[offset+base] I rs (offset = imm, base = reg)bge rs, rt, address if (rs >= rt) PC I addressblt rs, rt, address if (rs < rt) PC I addressb address PC I address NOTE: r0 is hardwired to 0Benchmark Code:Static Inst# Label Assembly_Instructionmain:1 addi r2, r0, A2 addi r3, r0, B3 addi r4, r0, C4 addi r5, r0, N5 add r10, r0, r06 bge r10, r5, endloop:7 lw r20, 0(r2)8 lw r21, 0(r3)9 bge r20, r21, T110 sw r21, 0(r4)11 b T2T1:12 sw r20, 0(r4)T2:13 addi r10, r10, 114 addi r2, r2, 415 addi r3, r3, 416 addi r4, r4, 417 blt r10, r5, loopend:1.Identify the basic blocks of this benchmark code by listing the static instructions belonging to each basic block in the following table. Number the basic blocks based on the lexical ordering of the code. Note: There may be more boxes than there are basic blocks.Modern Processor Design: Fundamentals of Superscalar Processors 36 2. Draw the control flow graph for this benchmark.3. Now generate the instruction execution trace (i.e., the sequence of basic blocks executed). Use the following arrays as input to the program, and trace the code execution by recording the num-ber of each basic block that is executed.N = 5;A[] = {8, 3, 2, 5, 9};B[] = {4, 9, 8, 5, 1};TRACE: 1, 2, 4, 5, 2, 3, 5, 2, 3, 5, 2, 4, 5, 2 ,4 5BB#123456789Instr. #s1-6 7-9 10-11 12 13-17BB 1BB 2BB 3 BB 4BB 5Modern Processor Design: Fundamentals of Superscalar Processors37 4. Fill in the following two tables based on the data you generated above. Note for Table 2: Count unconditional branches as taken branches.5. Given the branch profile information you collected above, rearrange the basic blocks and reverse the sense of the branches in the program snippet to minimize the number of taken branches and to pack the code so that the frequently-executed paths are placed together. Show the new program.Solution not provided.6. Given the new program you wrote in Problem 5, recompute the branch statistics in the last two rows of Table 5.Solution not provided.Table 4: Instruction MixStatic DynamicInstr. Class Number % Number %ALU9 532547Load/Store4 241528Branch4 241325Table 5: Basic Block / Branch DataStatic DynamicAverage BB size (# inst.)3.4 3.3Number of Taken Branches9Number of Not-Taken Branches4Modern Processor Design: Fundamentals of Superscalar Processors 38 Problem 7 through Problem 13Consider the following code segment within a loop body for problems 5:if (x is even) then I(branch b1)increment a I(b1 taken)if (x is a multiple of 10) then I(branch b2)increment b I(b2 taken)Assume that the following list of 9 values of x is to be processed by 9 iterations of this loop.8, 9, 10, 11, 12, 20, 29, 30, 31Note: assume that predictor entries are updated by each dynamic branch before the next dynamic branch accesses the predictor (i.e., there is no update delay).7. Assume that an one-bit (history bit) state machine (see above) is used as the prediction algorithm for predicting the execution of the two branches in this loop. Indicate the predicted and actual branch directions of the b1 and b2 branch instructions for each iteration of this loop. Assume ini-tial state of 0, i.e., NT, for the predictor.8 9 10 11 12 20 29 30 31b1 predicted: __N____T____N____T____N____T____T____N____T__b1 actual: __T____N____T____N____T____T____N____T____N__b2 predicted: __N____N____N____T____N____N____T____N____T__b2 actual: __N____N____T____N____N____T____N____T____N__8. What are the prediction accuracies for b1 and b2?b1: 1/9 = 11%b2: 3/9 = 33%9. What is the overall prediction accuracy?Overall: 4/18 = 22%0  NT 1  TBranch history table000111b1b2Modern Processor Design: Fundamentals of Superscalar Processors39 10. Assume a two-level branch prediction scheme is used. In addition to the one-bit predictor, a one-bit global register (g) is used. Register g stores the direction of the last branch executed (which may not be the same branch as the branch currently being predicted) and is used to index into two separate one-bit branch history tables (BHTs) as shown below.Depending on the value of g, one of the two BHTs is selected and used to do the normal one-bit prediction. Again, fill in the predicted and actual branch directions of b1 and b2 for nine itera-tions of the loop. Assume the initial value of g = 0, i.e., NT. For each prediction, depending on the current value of g, only one of the two BHTs is accessed and updated. Hence, some of the entries below should be empty.Note: assume that predictor entries are updated by each dynamic branch before the next dynamic branch accesses the predictor (i.e. there is no update delay).8 9 10 11 12 20 29 30 31For g=0b1 predicted: __N____T____N______ __T____T______ __T______b1 actual: __T____N____T____N____T____T____N____T____N__b2 predicted: ____ __N______ __N______ ____ __N______ __N__b2 actual: __N____N____T____N____N____T____N____T____N__For g=1b1 predicted: ____ ____ ____ __N______ ____ __N______ __N__b1 actual: __T____N____T____N____T____T____N____T____N__b2 predicted: __N______ __N______ __T____N______ __T______b2 actual: __N____N____T____N____N____T____N____T____N__11. What are the prediction accuracies for b1 and b2?b1: 6/9 =67%b2: 6/9 = 67%12. What is the overall prediction accuracy?Overall: 67%BHTb1b2BHTb1b2gg  0g  1Modern Processor Design: Fundamentals of Superscalar Processors 40 13. What is the prediction accuracy of b2 when g = 0? Explain why.100%. Whenever b1 is not taken (i.e. g=0), the number being checked is odd (not even). It fol-lows that the number is also not evenly divisible by ten. Hence, in these cases, b2 is always not taken and the predictor is able to predict b2 with high accuracy in this global context.14. Below is the control flow graph of a simple program. The CFG is annotated with three different execution trace paths. For each execution trace circle which branch predictor (bimodal, local, or Gselect) will best predict the branching behavior of


View Full Document

UCF CDA 5106 - Modern Processor Design: Fundamentals of Superscalar Processors

Download Modern Processor Design: Fundamentals of Superscalar Processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Modern Processor Design: Fundamentals of Superscalar Processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Modern Processor Design: Fundamentals of Superscalar Processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?