COMP 206: Computer Architecture and ImplementationOutlineQuantitative Principles of Computer DesignExample 1 (see HP3 pp. 42-45 for more examples)Example 1 (Soln. using Amdahl’s Law)Example 2Example 2 (Solution)Example 3Example 3 (Solution)Example 4Example 4 (Solution)Performance of (Blocking) CachesExampleMeansWeighted MeansRelations among MeansSummarizing Computer PerformanceArithmetic Mean for TimesHarmonic Mean for RatesAvoid the Geometric MeanPrograms to Evaluate PerformanceSPEC: Std Perf Evaluation CorpSPEC95 DetailsTrends in Integer PerformanceTrends in Floating Point PerformanceSPEC95 Ratings of ProcessorsSPEC95 vs SPEC CPU2000SPEC CPU2000 ExamplePerformance EvaluationCost of Integrated CircuitsExplanationsReal World ExamplesMoore’s LawMoore’s Law in Action at IntelMoore’s Law At Risk?Where Do The Transistors Go?Chip PhotographsEmbedded ProcessorsPower-Performance Tradeoff (Embedded)1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Sep 8, 2003Wed., Sep 8, 2003Lecture 3Lecture 32OutlineOutlineExamples (contd. from previous lecture)Examples (contd. from previous lecture)BenchmarksBenchmarksCostCostMoore’s LawMoore’s Law3Quantitative Principles of Computer Quantitative Principles of Computer DesignDesignT1P Execution timeResponse timeLatencyExecution timeResponse timeLatencyPerformanceRate of producing resultsThroughputBandwidthPerformanceRate of producing resultsThroughputBandwidthbitn / instructio / programresult / work / timetimebits / nsinstructio / program / resultswork /4Example 1 Example 1 (see HP3 pp. 42-45 for more (see HP3 pp. 42-45 for more examples)examples)Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)Fsqrt = fraction of FP sqrt resultsRsqrt = rate of producing FP sqrt resultsFnon-sqrt = fraction of non-sqrt resultsRnon-sqrt = rate of producing non-sqrt resultsFfp = fraction of FP resultsRfp = rate of producing FP resultsFnon-fp = fraction of non-FP resultsRnon-fp = rate of producing non-FP resultsRbefore = average rate of producing results before enhancementRafter = average rate of producing results after enhancementRFRFRFRFfpfpfp-nonfp-nonsqrtsqrtsqrt-nonsqrt-non45Example 1 (Soln. using Amdahl’s Example 1 (Soln. using Amdahl’s Law)Law)22.11.45511.411.4141.01151411RRRFR10FRRFRFRbeforeaftersqrt-nonsqrt-nonsqrtsqrtaftersqrt-nonsqrt-nonsqrtsqrtbeforexxxxxxxxImprove FP sqrt only33.15.12215.115.115.0112111RRRFR2FRRFRFRbeforeafterfp-nonfp-nonfpfpafterfp-nonfp-nonfpfpbeforeyyyyyyyyImprove all FP ops00.10.20.30.40.50.60.70.80.9Sqrt (b) Sqrt (a) FP (b) FP (a)6Example 2Example 2 Machine A Machine BOperation Frequency CPI Frequency CPICompare 0.2 1Branch 0.2 2Cmp&Branch 0.2/0.8=0.25 2Others 0.6 1 0.6/0.8=0.75 1Machine A Machine BClockrate 1.25 1Instruction count 1 0.8Which CPU performs better?Which CPU performs better?Why?7Example 2 (Solution)Example 2 (Solution)04.12.125.18.025.12.125.15.075.028.02.018.06.02.112.022.016.0ICClockrateICClockrate1.25ICCPIClockrateICCPIClockratePerfPerfCPICPIABABBBBAAABABAIf clock cycle time of A was only 1.1x clock cycle time of B,then CPU B would be about 9% higher performance.8Example 3Example 3A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 29Example 3 (Solution)Example 3 (Solution)Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2TIC 57.1T1.57 IC timecycleClock CPI IC timeCPU1.5720.24)0.12(0.2110.43 CPIBefore changeInstruction type Frequency CPIALU ops(0.43-x)/(1-x) 1Loads(0.21-x)/(1-x) 2Stores0.12/(1-x ) 2Branches0.24/(1-x) 3Reg-mem opsx/(1-x)2TIC 1.703 T908.1 IC)-(1 timecycleClock CPI IC timeCPU908.10.89251.7025-130.242)0.12-(0.211)-(0.43 CPI1075.040.43 xxxxxxAfter changeSince CPU time increases, change will not improve performance.10Example 4Example 4A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution
View Full Document