Spring 2011EECS150 - Lec09-cpuPage EECS150 - Digital DesignLecture 9- CPU MicroarchitectureFeb 15, 2011John Wawrzynek1Spring 2011EECS150 - Lec09-cpuPage Watson: Jeopardy-playing ComputerWatson is made up of a cluster of ninety IBM Power 750 servers (plus additional I/O, network and cluster controller nodes in 10 racks) for a total of 2880 POWER7 processor cores and 16 Terabytes of RAM. Each Power 750 server uses a 3.5 GHz POWER7 eight core processor, with four threads per core. and it still takes ~15 seconds a question.Each core can do 8 double-precision FLOPS/cycle. So total is 2880*3.5*8 > 80 TFLOPS2Spring 2011EECS150 - Lec09-cpuPage Processor Microarchitecture IntroductionMicroarchitecture: how to implement an architecture in hardwareGood examples of how to put principles of digital design to practice.Introduction to final project.3Spring 2011EECS150 - Lec09-cpuPage MIPS Processor Architecture• For now we consider a subset of MIPS instructions:–R-type instructions: and, or, add, sub, slt–Memory instructions: lw, sw–Branch instructions: beq• Later we’ll add addi and j4Spring 2011EECS150 - Lec09-cpuPage MIPS Micrarchitecture Oganization5Datapath + Controller + External MemoryControllerSpring 2011EECS150 - Lec09-cpuPage How to Design a Processor: step-by-step1. Analyze instruction set architecture (ISA) ⇒ datapath requirements– meaning of each instruction is given by the data transfers (register transfers)– datapath must include storage element for ISA registers– datapath must support each data transfer2. Select set of datapath components and establish clocking methodology3. Assemble datapath meeting requirements4. Analyze implementation of each instruction to determine setting of control points that effects the data transfer.5. Assemble the control logic.6Spring 2011EECS150 - Lec09-cpuPage Review: The MIPS Instruction R-typeI-typeJ-typeThe different fields are:op: operation (“opcode”) of the instructionrs, rt, rd: the source and destination register specifiersshamt: shift amountfunct: selects the variant of the operation in the “op” fieldaddress / immediate: address offset or immediate valuetarget address: target address of jump instruction op target address026316 bits 26 bitsop rs rt rd shamt funct0611162126316 bits 6 bits5 bits5 bits5 bits5 bitsop rs rtaddress/immediate0162126316 bits 16 bits5 bits5 bits7Spring 2011EECS150 - Lec09-cpuPage Subset for Lectureadd, sub, or, slt•addu rd,rs,rt•subu rd,rs,rtlw, sw•lw rt,rs,imm16•sw rt,rs,imm16beq•beq rs,rt,imm16 op rs rt rd shamt funct0611162126316 bits 6 bits5 bits5 bits5 bits5 bitsop rs rt immediate0162126316 bits 16 bits5 bits5 bitsop rs rt immediate0162126316 bits 16 bits5 bits5 bits8Spring 2011EECS150 - Lec09-cpuPage Register Transfer DescriptionsAll start with instruction fetch:{op , rs , rt , rd , shamt , funct} ← IMEM[ PC ] OR{op , rs , rt , Imm16} ← IMEM[ PC ] THENinst Register Transfersadd! R[rd] ← R[rs] + R[rt];! ! ! PC ← PC + 4sub! R[rd] ← R[rs] – R[rt];! ! PC ← PC + 4or R[rd] ← R[rs] | R[rt]; PC ← PC + 4slt! R[rd] ← (R[rs] < R[rt]) ? 1 : 0; ! PC ← PC + 4lw! R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]; PC ← PC + 4sw! DMEM[ R[rs] + sign_ext(Imm16) ] ← R[rt]; PC ← PC + 4beq if ( R[rs] == R[rt] ) then PC ← PC + 4 + {sign_ext(Imm16), 00} else PC ← PC + 49Spring 2011EECS150 - Lec09-cpuPage MicroarchitectureMultiple implementations for a single architecture:– Single-cycle• Each instruction executes in a single clock cycle.– Multicycle• Each instruction is broken up into a series of shorter steps with one step per clock cycle.– Pipelined (variant on “multicycle”)• Each instruction is broken up into a series of steps with one step per clock cycle• Multiple instructions execute at once.10Spring 2011EECS150 - Lec09-cpuPage CPU clocking (1/2)• Single Cycle CPU: All stages of an instruction are completed within one long clock cycle. – The clock cycle is made sufficient long to allow each instruction to complete all stages without interruption and within one cycle.1. InstructionFetch2. Decode/ RegisterRead3. Execute 4. Memory5. Reg. Write11Spring 2011EECS150 - Lec09-cpuPage CPU clocking (2/2)• Multiple-cycle CPU: Only one stage of instruction per clock cycle. – The clock is made as long as the slowest stage.Several significant advantages over single cycle execution: Unused stages in a particular instruction can be skipped OR instructions can be pipelined (overlapped).1. InstructionFetch2. Decode/ RegisterRead3. Execute 4. Memory5. Reg. Write12Spring 2011EECS150 - Lec09-cpuPage MIPS State Elements13• Determines everything about the execution status of a processor:–PC register– 32 registers– MemoryNote: for these state elements, clock is used for write but not for read (asynchronous read, synchronous write).Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw fetch•First consider executing lw•STEP 1: Fetch instruction14R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw register read•STEP 2: Read source operands from register file15R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw immediate•STEP 3: Sign-extend the immediate16R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw address•STEP 4: Compute the memory address17R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw memory read•STEP 5: Read data from memory and write it back to register file18R[rt] ← DMEM[ R[rs] + sign_ext(Imm16)]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: lw PC increment•STEP 6: Determine the address of the next instruction19PC ← PC + 4Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: sw•Write data in rt to memory20DMEM[ R[rs] + sign_ext(Imm16) ] ← R[rt]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: R-type instructions•Read from rs and rt• Write ALUResult to register file•Write to rd (instead of rt)21R[rd] ← R[rs] op R[rt]Spring 2011EECS150 - Lec09-cpuPage Single-Cycle Datapath: beq•Determine whether values in rs and rt are equal• Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4)22if ( R[rs] == R[rt] ) then PC
View Full Document