Spring 2010 Prof. Hyesoon Kim• MIPS (Microprocessor without interlocked pipeline stages) • MIPS Computer Systems Inc. • MIPS architecture usages • 1990’s – R2000, R3000, R4000, Motorola 68000 family • Playstation, Playstation 2, Sony PSP handheld, Nintendo 64 console • Androidhttp://en.wikipedia.org/wiki/MIPS_architecture• MIPS R4000 CPU core• Floating point and vector floating point co-processors • 3D-CG extended instruction sets • Graphics – 3D curved surface and other 3D functionality– Hardware clipping, compressed texture handling • R4300 (embedded version) – Nintendo-64http://www.digitaltrends.com/gaming/sony-announces-playstation-portable-specs/• Started from 32-bit • Later 64-bit • 16-bit compression version (similar to ARM thumb)• SIMD additions-64 bit floating points http://www.spiritus-temporis.com/mips-architecture/• Conditionally move one CPU general register to another • Limited form of predicated execution. – Difference between fully predicated execution and conditional move?• 32-bit fixed format inst (3 formats)• 31 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)• partitioned by software convention• 3-address, reg-reg arithmetic instr.• Single address mode for load/store: base+displacement• Simple branch conditions• compare one register against zero or two registers for =,• no condition codes for integer operationsThe Mips R4000 Processor, Mirapuri, S.; Woodacre, M.; Vasseghi, N.; Micro, IEEE Volume: 12 , Issue: 2 Publication Year: 1992 , Page(s): 10 - 22The Mips R4000 Processor, Mirapuri, Woodacre,Vasseghi, N., ‘92P-cache: Primary cache S-cache: Secondary cache• Q: Tag check stage, why is it at the end of load access? • A: virtual indexed physically tagged (VIPT)VirtualAddressCacheTagsHit?TLBPhysical Address=Physical TagCacheDataR2000 load has a delay slot LW ra ---Addi ra rb rcAddi ra rb rcGood idea? Bad Idea?R4000 does not have load delay slots. See old Ra value ( before load)• 2-cycle delay loads• Data is not available until the end of DS • Only DF/DS/TC/WB stages make a progress for load instructions (IS/RF/EX pipeline stages stall)• 2-level cache hierarchy• Different line sizes – Pros? cons? • Inclusive cache• Primary cache: initial design 8BKB 32KB – Direct-mapped, VIPT– 16 or 32B software programmable line size • Secondary cache– 128-bit, up to 4MBFE ID EX MEM WBbr0x800br0x804brbrbr0x8040x9000x904PC (latch)addaddsubadd0x9081cycle23456mul subsubaddFE_stageAlways two cycles of pipeline bubble0x8000x8040x8080x80b0x8100x900 target mul r2, r3,r4sub r1, r2,r3add r4, r2,r3br targetChange the rule!Always execute the next two instructions after a branch0x900 target mul r2, r3,r40x900 target mul r2, r3,r4sub r1, r2,r3add r4, r2,r3br target0x8000x804 0x808FE ID EX MEM WBbr0x800br0x804brbrbr0x8080x9000x904Fetch addrsubaddmuldiv0x9081cycle23456add mulmuldivsubadd subadd subadd sub0x90b7sub muldivaddaddNo pipeline bubble!!• N-cycle delay slot• The compiler fills out useful instructions inside the delay slot• Different options:– Fill the slot from before the branch instruction• Restriction: branch must not depend on result of the filled instruction – Fill the slot from the target of the branch instruction• Restriction: should be OK to execute instruction even if not taken– Fill the slot from fall through of the branch• Restriction: should be OK to execute instruction even if takenStill Cancel or nullifying instructions• Branch:– Execute the instructions in the delay slot • Branch likely– Do not execute instructions in the delay slot if the branch is not taken• No not use branch likely! – It won’t be supported in the future• Many DSP architecture, older RISC, MIPS, PA-RISC, SPARC.• Delayed branches are architecturally invisible– Advantage:• better performance– Disadvantage: • what if implementation changes? • Deeper pipeline-> more branch delays? • Interrupt/exceptions? – Where to go back?• Combining with a branch predictor? visible• Later designs are based on R10K • Out-of-order super scalar processor• ROB, 32 in-flight instructions • 4-instruction
View Full Document