CORNELL CS 3410 - A Simple CPU: remaining branch instructions CPU Performance - D708775

Home> Schools> Cornell University> Computer Science (CS) > CS 3410> A Simple CPU: remaining branch instructions CPU Performance

DOC PREVIEW

CORNELL CS 3410 - A Simple CPU: remaining branch instructions CPU Performance

School name Cornell University

Course Cs 3410- Computer System Organization and Programming

Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Memory LayoutControl Flow: More BranchesAbsolute JumpControl Flow: Jump and LinkAbsolute JumpSlide 7Design GoalsPerformanceHow Fast?Adder PerformanceOptimization: SummaryProcessor Clock CycleProcessor Clock CycleMulti-Cycle InstructionsCPIExampleAmdahl’s LawSlide 19The KidsThe BicycleThe MaterialsThe InstructionsDesign 1: Sequential ScheduleSequential PerformanceDesign 2: Pipelined DesignPipelined PerformanceUnequal Pipeline StagesPoorly-balanced Pipeline StagesRe-Balanced Pipeline StagesSplitting Pipeline StagesPipeline HazardsLessonsA ProcessorA ProcessorBasic PipelinePipelined Implementation1CS3410Guest LectureA Simple CPU: remaining branch instructionsCPU PerformancePipelined CPUTudor Marian2Memory LayoutExamples (big/little endian):# r5 contains 5 (0x00000005)sb r5, 2(r0)lb r6, 2(r0)sw r5, 8(r0)lb r7, 8(r0)lb r8, 11(r0)0x000000000x000000010x000000020x000000030x000000040x000000050x000000060x000000070x000000080x000000090x0000000a0x0000000b...0xffffffff3Control Flow: More Branchesop rs subopoffset6 bits 5 bits 5 bits 16 bits00000100101000010000000000000010signed offsetsalmost I-Typeop subopmnemonic description0x1 0x0 BLTZ rs, offset if R[rs] < 0 then PC = PC+4+ (offset<<2)0x1 0x1 BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2)0x6 0x0 BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2)0x7 0x0 BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2)Conditional Jumps (cont.)4Absolute Jumptgt+4||Data Memaddrext5 5 5Reg.FilePCProg.MemALUinstcontrolimmoffset+Could have used ALU for branch cmp=?cmpop subopmnemonic description0x1 0x0 BLTZ rs, offset if R[rs] < 0 then PC = PC+4+ (offset<<2)0x1 0x1 BGEZ rs, offset if R[rs] ≥ 0 then PC = PC+4+ (offset<<2)0x6 0x0 BLEZ rs, offset if R[rs] ≤ 0 then PC = PC+4+ (offset<<2)0x7 0x0 BGTZ rs, offset if R[rs] > 0 then PC = PC+4+ (offset<<2)5Control Flow: Jump and Linkop mnemonic description0x3 JAL target r31 = PC+8 (+8 due to branch delay slot)PC = (PC+4)31..28 || (target << 2)op immediate6 bits26 bits00001100000001001000011000000010J-TypeFunction/procedure callsop mnemonic description0x2 J target PC = (PC+4)31..28 || (target << 2)6Absolute Jumptgt+4||Data Memaddrext5 5 5Reg.FilePCProg.MemALUinstcontrolimmoffset+=?cmpCould have used ALU for link add+4op mnemonic description0x3 JAL target r31 = PC+8 (+8 due to branch delay slot)PC = (PC+4)31..28 || (target << 2)7PerformanceSee: P&H 1.48Design GoalsWhat to look for in a computer system?•Correctness: negotiable?•Cost–purchase cost = f(silicon size = gate count, economics)–operating cost = f(energy, cooling) –operating cost >= purchase cost•Efficiency–power = f(transistor usage, voltage, wire size, clock rate, …)–heat = f(power)•Intel Core i7 Bloomfield: 130 Watts•AMD Turion: 35 Watts•Intel Core 2 Solo: 5.5 Watts•Cortex-A9 Dual Core @800MHz: 0.4 Watts•Performance•Other: availability, size, greenness, features, …9PerformanceHow to measure performance?GHz (billions of cycles per second)MIPS (millions of instructions per second) MFLOPS (millions of floating point operations per second)benchmarks (SPEC, TPC, …)Metricslatency: how long to finish my programthroughput: how much work finished per unit time10How Fast?ALUPCProg.MemcontrolnewpcReg.File~ 3 gatesAll signals are stable•80 gates => clock period of at least 160 ns, max frequency ~6MHzBetter:•21 gates => clock period of at least 42 ns, max frequency ~24MHz Assumptions:•alu: 32 bit ripple carry + some muxes•next PC: 30 bit ripple carry•control: minimized for delay (~3 gates)•transistors: 2 ns per gate•prog,. memory: 16 ns (as much as 8 gates)•register file: 2 ns access•ignore wires, register setup timeBetter:•alu: 32 bit carry lookahead + some muxes (~ 9 gates)•next PC: 30 bit carry lookahead (~ 6 gates)Better Still:•next PC: cheapest adder faster than 21 gate delays11Adder Performance32 Bit Adder Design Space TimeRipple Carry ≈ 300 gates ≈ 64 gate delays2-Way Carry-Skip ≈ 360 gates ≈ 35 gate delays3-Way Carry-Skip ≈ 500 gates ≈ 22 gate delays4-Way Carry-Skip ≈ 600 gates ≈ 18 gate delays2-Way Look-Ahead ≈ 550 gates ≈ 16 gate delaysSplit Look-Ahead ≈ 800 gates ≈ 10 gate delaysFull Look-Ahead ≈ 1200 gates ≈ 5 gate delays12Optimization: SummaryCritical Path•Longest path from a register output to a register input•Determines minimum cycle, maximum clock frequencyStrategy 1 (we just employed)•Optimize for delay on the critical path•Optimize for size / power / simplicity elsewhere –next PC13 Processor Clock CyclealuPCimmmemorymemorydindoutaddrtargetoffsetcmpcontrol=?extendnew pcregisterfileop mnemonic description0x20 LB rd, offset(rs) R[rd] = sign_ext(Mem[offset+R[rs]])0x23 LW rd, offset(rs) R[rd] = Mem[offset+R[rs]]0x28 SB rd, offset(rs) Mem[offset+R[rs]] = R[rd]0x2b SW rd, offset(rs) Mem[offset+R[rs]] = R[rd]14 Processor Clock CyclealuPCimmmemorymemorydindoutaddrtargetoffsetcmpcontrol=?extendnew pcregisterfileop func mnemonic description0x0 0x08 JR rs PC = R[rs]op mnemonic description0x2 J target PC = (PC+4)31..28 || (target << 2)15Multi-Cycle InstructionsStrategy 2•Multiple cycles to complete a single instructionE.g: Assume:•load/store: 100 ns•arithmetic: 50 ns•branches: 33 nsMulti-Cycle CPU30 MHz (33 ns cycle) with–3 cycles per load/store–2 cycles per arithmetic–1 cycle per branchFaster than Single-Cycle CPU?10 MHz (100 ns cycle) with–1 cycle per instruction16CPIInstruction mix for some program P, assume:•25% load/store ( 3 cycles / instruction)•60% arithmetic ( 2 cycles / instruction)•15% branches ( 1 cycle / instruction)Multi-Cycle performance for program P:3 * .25 + 2 * .60 + 1 * .15 = 2.1average cycles per instruction (CPI) = 2.1Multi-Cycle @ 30 MHzSingle-Cycle @ 10 MHzSingle-Cycle @ 15 MHz800 MHz PIII “faster” than 1 GHz P417ExampleGoal: Make Multi-Cycle @ 30 MHz CPU (15MIPS) run 2x faster by making arithmetic instructions fasterInstruction mix (for P):•25% load/store, CPI = 3 •60% arithmetic, CPI = 2•15% branches, CPI = 118Amdahl’s LawAmdahl’s LawExecution time after improvement =Or:Speedup is limited by popularity of improved featureCorollary: Make the common case fastCaveat:Law of diminishing returnsexecution time affected by improvementamount of improvement+ execution time unaffected19PipeliningSee: P&H Chapter 4.520The KidsAliceBobThey don’t always get along…21The Bicycle22The MaterialsSawDrillGluePaint23The InstructionsN pieces, each built following same

View Full Document