CS252 Graduate Computer Architecture Lecture 20: Static Pipelining #2 and Goodbye to Computer ArchitectureReview #1: Hardware versus Software Speculation MechanismsReview #2: Hardware versus Software Speculation Mechanisms cont’dReview #3: Software SchedulingVLIW in Embedded DesignsExample VLIW for multimedia: Philips Trimedia CPUTrimedia OperationsTrimedia Functional Units, Latency, Instruction SlotsPhilips Trimedia CPUExampleSlide 11Tridmedia VersionClock cycles to execute 2D iDCTAdministratriviaTransmeta Crusoe MPUCrusoe processor: BasicsCrusoe processor: Operations80x86 CompatabilityException Behavior during SpeculationCrusoe Performance?Real Time, so comparison is EnergyCrusoe Applications?VLIW ReadingsReview of CourseChapter 1: Performance and CostSlide 26Goodbye to Performance and CostChapter 5: Memory HierarchyCache Optimization SummaryGoodbye to Memory HierarchyChapter 6: Storage I/OSummary: I/O BenchmarksGoodbye to Storage I/OSlide 34Chapter 7: NetworksSlide 36Review: NetworkingGoodbye to NetworksChapter 8: MultiprocessorsGoodbye to MultiprocessorsChapter 2: Instruction Set ArchitectureGoodbye to Instruction Set ArchitectureGoodbye to Dynamic ExecutionGoodbye to Static, EmbeddedGoodbye to Computer ArchitectureSlide 46CS252/PattersonLec 20.14/13/01CS252Graduate Computer ArchitectureLecture 20: Static Pipelining #2 and Goodbye to Computer ArchitectureApril 13, 2001Prof. David A. PattersonComputer Science 252Spring 2001CS252/PattersonLec 20.24/13/01Review #1: Hardware versus Software Speculation Mechanisms•To speculate extensively, must be able to disambiguate memory references–Much easier in HW than in SW for code with pointers•HW-based speculation works better when control flow is unpredictable, and when HW-based branch prediction is superior to SW-based branch prediction done at compile time–Mispredictions mean wasted speculation•HW-based speculation maintains precise exception model even for speculated instructions•HW-based speculation does not require compensation or bookkeeping codeCS252/PattersonLec 20.34/13/01Review #2: Hardware versus Software Speculation Mechanisms cont’d•Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling•HW-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture–may be the most important in the long run?CS252/PattersonLec 20.44/13/01Review #3: Software Scheduling•Instruction Level Parallelism (ILP) found either by compiler or hardware.•Loop level parallelism is easiest to see–SW dependencies/compiler sophistication determine if compiler can unroll loops–Memory dependencies hardest to determine => Memory disambiguation–Very sophisticated transformations available•Trace Sceduling to Parallelize If statements•Superscalar and VLIW: CPI < 1 (IPC > 1)–Dynamic issue vs. Static issue–More instructions issue at same time => larger hazard penalty–Limitation is often number of instructions that you can successfully fetch and decode per cycleCS252/PattersonLec 20.54/13/01VLIW in Embedded Designs•VLIW: greater parallelism under programmer, compiler control vs. hardware in superscalar•Used in DSPs, Multimedia processors as well as IA-64•What about code size?•Effectiveness, Quality of compilers for these applications?CS252/PattersonLec 20.64/13/01Example VLIW for multimedia:Philips Trimedia CPU•Every instruction contains 5 operations •Predicated with single register value; if 0 => all 5 operations are canceled•128 64-bit registers, which contain either integer or floating point data•Partitioned ALU (SIMD) instructions to compute on multiple instances of narrow data•Offers both saturating arithmetic (DSPs) and 2’s complement arithmetic (desktop)•Delayed Branch with 3 branch slotsCS252/PattersonLec 20.74/13/01Trimedia OperationsOperationCategoryExamples No.OpsCommentLoad/storeopsld8, ld16, ld32, ld64,limm. st8,st16, st32, st6439 SIMD, signed, unsigned,register indirect, indexed,scaled addressingByteshufflesshift right 1-, 2- , 3-bytes, selectbyte, merge, pack67 SIMD type convertBit shiftsasl, asr, lsl, lsr, rol,48 round, fields, SIMDMultipliesmul, sum of products, sum-of-SIMD-elements54 round, saturate, 2’scomp, SIMDIntegerarithmeticadd, sub, min, max, abs, average,bitand, bitor, bitxor, bitinv,bitandinv, eql, neq, gtr, geq, les,leq, sign extend, zero extend,sum of absolute differences104 saturate, 2’s comp,unsigned, immediate,SIMDFloatingpointadd, sub, neg, mul, div, sqrt eql,neq, gtr, geq, les, leq, IEEE flags59 scalar and SIMDLookuptableSIMD gather load using registersas addresses6 SIMDSpecial opsalloc, prefetch block, invalidateblock, copy block back, read tagread, cache status, read counter23 MMU, cache, specialregsBranchjmpt, jmpf10 (un)interruptible, trapTotal 410•large number of ops because used retargetable compilers, multiple machine descriptions, and die size estimators to explore the space to find the best cost-performance design–Verification time,manufacturing test, design time?CS252/PattersonLec 20.84/13/01Trimedia Functional Units, Latency, Instruction SlotsF.U. Latency Operation Slot Typical operations performedby functional unit1 2 3 4 5ALU 0 X X X X X Integer add/subtract/compare,logicalsDMem 2 X X Loads and storesDMemSpec2 X Cache invalidate, prefetch,allocateShifter 0 X X Shifts and rotatesDSPALU 1 X X Simple DSP arithmetic opsDSPMul 2 X X DSP ops with multiplicationBranch 3 X X X Branches and jumpsFALU 2 X X FP add, subtractIFMul 2 X X Integer and FP multiplyFComp 0 X FP compareFTough 16 X FP divide, square root•23 functional units of 11 types, •which of 5 slots can issue (and hence number of functional units)CS252/PattersonLec 20.94/13/01Philips Trimedia CPU•Compiler responsible for including no-ops –both within an instruction-- when an operation field cannot be used--and between dependent instructions–processor does not detect hazards, which if present will lead to incorrect execution•Code size? compresses the code (~ Quiz #1)–decompresses after fetched from instruction cacheCS252/PattersonLec 20.104/13/01Example•Using MIPS notation, look at code forvoid sum (int a[], int b[], int c[], int n){ int i;for (i=0; i<n; i++) c[i] = a[i]+b[i];CS252/PattersonLec 20.114/13/01Example•MIPS code for loopLoop: LD R11,R0(R4) # R11 = a[i]LD R12,R0(R5) # R12 = b[i]DADDU
View Full Document