Slide 1Memory LayoutControl Flow: More BranchesAbsolute JumpControl Flow: Jump and LinkAbsolute JumpSlide 7Design GoalsPerformanceHow Fast?Adder PerformanceOptimization: SummaryProcessor Clock CycleProcessor Clock CycleMulti-Cycle InstructionsCPIExampleAmdahl’s LawSlide 19The KidsThe BicycleThe MaterialsThe InstructionsDesign 1: Sequential ScheduleSequential PerformanceDesign 2: Pipelined DesignPipelined PerformanceUnequal Pipeline StagesPoorly-balanced Pipeline StagesRe-Balanced Pipeline StagesSplitting Pipeline StagesPipeline HazardsLessonsA ProcessorA ProcessorBasic PipelinePipelined Implementation1CS3410Guest LectureA Simple CPU: remaining branch instructionsCPU PerformancePipelined CPUTudor Marian2Memory LayoutExamples (big/little endian):# r5 contains 5 (0x00000005)sb r5, 2(r0)lb r6, 2(r0)sw r5, 8(r0)lb r7, 8(r0)lb r8, 11(r0)0x000000000x000000010x000000020x000000030x000000040x000000050x000000060x000000070x000000080x000000090x0000000a0x0000000b...0xffffffff3Control Flow: More Branchesop rs subopoffset6 bits 5 bits 5 bits 16 bits00000100101000010000000000000010signed offsetsalmost I-Typeop subopmnemonic description0x1 0x0 BLTZ rs, offset if R[rs] < 0 then PC = PC+4+ (offset<<2)0x1 0x1 BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2)0x6 0x0 BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2)0x7 0x0 BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2)Conditional Jumps (cont.)4Absolute Jumptgt+4||Data Memaddrext5 5 5Reg.FilePCProg.MemALUinstcontrolimmoffset+Could have used ALU for branch cmp=?cmpop subopmnemonic description0x1 0x0 BLTZ rs, offset if R[rs] < 0 then PC = PC+4+ (offset<<2)0x1 0x1 BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2)0x6 0x0 BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2)0x7 0x0 BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2)5Control Flow: Jump and Linkop mnemonic description0x3 JAL target r31 = PC+8 (+8 due to branch delay slot)PC = (PC+4)31..28 || (target << 2)op immediate6 bits26 bits00001100000001001000011000000010J-TypeFunction/procedure callsop mnemonic description0x2 J target PC = (PC+4)31..28 || (target << 2)6Absolute Jumptgt+4||Data Memaddrext5 5 5Reg.FilePCProg.MemALUinstcontrolimmoffset+=?cmpCould have used ALU for link add+4op mnemonic description0x3 JAL target r31 = PC+8 (+8 due to branch delay slot)PC = (PC+4)31..28 || (target << 2)7PerformanceSee: P&H 1.48Design GoalsWhat to look for in a computer system?•Correctness: negotiable?•Cost–purchase cost = f(silicon size = gate count, economics)–operating cost = f(energy, cooling) –operating cost >= purchase cost•Efficiency–power = f(transistor usage, voltage, wire size, clock rate, …)–heat = f(power)•Intel Core i7 Bloomfield: 130 Watts•AMD Turion: 35 Watts•Intel Core 2 Solo: 5.5 Watts•Cortex-A9 Dual Core @800MHz: 0.4 Watts•Performance•Other: availability, size, greenness, features, …9PerformanceHow to measure performance?GHz (billions of cycles per second)MIPS (millions of instructions per second) MFLOPS (millions of floating point operations per second)benchmarks (SPEC, TPC, …)Metricslatency: how long to finish my programthroughput: how much work finished per unit time10How Fast?ALUPCProg.MemcontrolnewpcReg.File~ 3 gatesAll signals are stable•80 gates => clock period of at least 160 ns, max frequency ~6MHzBetter:•21 gates => clock period of at least 42 ns, max frequency ~24MHz Assumptions:•alu: 32 bit ripple carry + some muxes•next PC: 30 bit ripple carry•control: minimized for delay (~3 gates)•transistors: 2 ns per gate•prog,. memory: 16 ns (as much as 8 gates)•register file: 2 ns access•ignore wires, register setup timeBetter:•alu: 32 bit carry lookahead + some muxes (~ 9 gates)•next PC: 30 bit carry lookahead (~ 6 gates)Better Still:•next PC: cheapest adder faster than 21 gate delays11Adder Performance32 Bit Adder Design Space TimeRipple Carry ≈ 300 gates ≈ 64 gate delays2-Way Carry-Skip ≈ 360 gates ≈ 35 gate delays3-Way Carry-Skip ≈ 500 gates ≈ 22 gate delays4-Way Carry-Skip ≈ 600 gates ≈ 18 gate delays2-Way Look-Ahead ≈ 550 gates ≈ 16 gate delaysSplit Look-Ahead ≈ 800 gates ≈ 10 gate delaysFull Look-Ahead ≈ 1200 gates ≈ 5 gate delays12Optimization: SummaryCritical Path•Longest path from a register output to a register input•Determines minimum cycle, maximum clock frequencyStrategy 1 (we just employed)•Optimize for delay on the critical path•Optimize for size / power / simplicity elsewhere –next PC13 Processor Clock CyclealuPCimmmemorymemorydindoutaddrtargetoffsetcmpcontrol=?extendnew pcregisterfileop mnemonic description0x20 LB rd, offset(rs) R[rd] = sign_ext(Mem[offset+R[rs]])0x23 LW rd, offset(rs) R[rd] = Mem[offset+R[rs]]0x28 SB rd, offset(rs) Mem[offset+R[rs]] = R[rd]0x2b SW rd, offset(rs) Mem[offset+R[rs]] = R[rd]14 Processor Clock CyclealuPCimmmemorymemorydindoutaddrtargetoffsetcmpcontrol=?extendnew pcregisterfileop func mnemonic description0x0 0x08 JR rs PC = R[rs]op mnemonic description0x2 J target PC = (PC+4)31..28 || (target << 2)15Multi-Cycle InstructionsStrategy 2•Multiple cycles to complete a single instructionE.g: Assume:•load/store: 100 ns•arithmetic: 50 ns•branches: 33 nsMulti-Cycle CPU30 MHz (33 ns cycle) with–3 cycles per load/store–2 cycles per arithmetic–1 cycle per branchFaster than Single-Cycle CPU?10 MHz (100 ns cycle) with–1 cycle per instruction16CPIInstruction mix for some program P, assume:•25% load/store ( 3 cycles / instruction)•60% arithmetic ( 2 cycles / instruction)•15% branches ( 1 cycle / instruction)Multi-Cycle performance for program P:3 * .25 + 2 * .60 + 1 * .15 = 2.1average cycles per instruction (CPI) = 2.1Multi-Cycle @ 30 MHzSingle-Cycle @ 10 MHzSingle-Cycle @ 15 MHz800 MHz PIII “faster” than 1 GHz P417ExampleGoal: Make Multi-Cycle @ 30 MHz CPU (15MIPS) run 2x faster by making arithmetic instructions fasterInstruction mix (for P):•25% load/store, CPI = 3 •60% arithmetic, CPI = 2•15% branches, CPI = 118Amdahl’s LawAmdahl’s LawExecution time after improvement =Or:Speedup is limited by popularity of improved featureCorollary: Make the common case fastCaveat:Law of diminishing returnsexecution time affected by improvementamount of improvement+ execution time unaffected19PipeliningSee: P&H Chapter 4.520The KidsAliceBobThey don’t always get along…21The Bicycle22The MaterialsSawDrillGluePaint23The InstructionsN pieces, each built following same
View Full Document