UNC-Chapel Hill COMP 206 - LECTURE NOTES - D1966768

Home> Schools> University of North Carolina at Chapel Hill> Computer Science (COMP) > COMP 206> LECTURE NOTES

UNC-Chapel Hill COMP 206 - LECTURE NOTES

School name University of North Carolina at Chapel Hill

Course Comp 206- Computer Architecture and Implementation

Pages 39

Download Save

Unformatted text preview:

COMP 206: Computer Architecture and ImplementationOutlineQuantitative Principles of Computer DesignExample 1 (see HP3 pp. 42-45 for more examples)Example 1 (Soln. using Amdahl’s Law)Example 2Example 2 (Solution)Example 3Example 3 (Solution)Example 4Example 4 (Solution)Performance of (Blocking) CachesExampleMeansWeighted MeansRelations among MeansSummarizing Computer PerformanceArithmetic Mean for TimesHarmonic Mean for RatesAvoid the Geometric MeanPrograms to Evaluate PerformanceSPEC: Std Perf Evaluation CorpSPEC95 DetailsTrends in Integer PerformanceTrends in Floating Point PerformanceSPEC95 Ratings of ProcessorsSPEC95 vs SPEC CPU2000SPEC CPU2000 ExamplePerformance EvaluationCost of Integrated CircuitsExplanationsReal World ExamplesMoore’s LawMoore’s Law in Action at IntelMoore’s Law At Risk?Where Do The Transistors Go?Chip PhotographsEmbedded ProcessorsPower-Performance Tradeoff (Embedded)1COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementationMontek SinghMontek SinghWed., Sep 8, 2003Wed., Sep 8, 2003Lecture 3Lecture 32OutlineOutlineExamples (contd. from previous lecture)Examples (contd. from previous lecture)BenchmarksBenchmarksCostCostMoore’s LawMoore’s Law3Quantitative Principles of Computer Quantitative Principles of Computer DesignDesignT1P Execution timeResponse timeLatencyExecution timeResponse timeLatencyPerformanceRate of producing resultsThroughputBandwidthPerformanceRate of producing resultsThroughputBandwidthbitn / instructio / programresult / work / timetimebits / nsinstructio / program / resultswork /4Example 1 Example 1 (see HP3 pp. 42-45 for more (see HP3 pp. 42-45 for more examples)examples)Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)Which change is more effective on a certain machine: speeding up 10-fold the floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, which take up 50% of total execution time? (Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)Fsqrt = fraction of FP sqrt resultsRsqrt = rate of producing FP sqrt resultsFnon-sqrt = fraction of non-sqrt resultsRnon-sqrt = rate of producing non-sqrt resultsFfp = fraction of FP resultsRfp = rate of producing FP resultsFnon-fp = fraction of non-FP resultsRnon-fp = rate of producing non-FP resultsRbefore = average rate of producing results before enhancementRafter = average rate of producing results after enhancementRFRFRFRFfpfpfp-nonfp-nonsqrtsqrtsqrt-nonsqrt-non45Example 1 (Soln. using Amdahl’s Example 1 (Soln. using Amdahl’s Law)Law)22.11.45511.411.4141.01151411RRRFR10FRRFRFRbeforeaftersqrt-nonsqrt-nonsqrtsqrtaftersqrt-nonsqrt-nonsqrtsqrtbeforexxxxxxxxImprove FP sqrt only33.15.12215.115.115.0112111RRRFR2FRRFRFRbeforeafterfp-nonfp-nonfpfpafterfp-nonfp-nonfpfpbeforeyyyyyyyyImprove all FP ops00.10.20.30.40.50.60.70.80.9Sqrt (b) Sqrt (a) FP (b) FP (a)6Example 2Example 2 Machine A Machine BOperation Frequency CPI Frequency CPICompare 0.2 1Branch 0.2 2Cmp&Branch 0.2/0.8=0.25 2Others 0.6 1 0.6/0.8=0.75 1Machine A Machine BClockrate 1.25 1Instruction count 1 0.8Which CPU performs better?Which CPU performs better?Why?7Example 2 (Solution)Example 2 (Solution)04.12.125.18.025.12.125.15.075.028.02.018.06.02.112.022.016.0ICClockrateICClockrate1.25ICCPIClockrateICCPIClockratePerfPerfCPICPIABABBBBAAABABAIf clock cycle time of A was only 1.1x clock cycle time of B,then CPU B would be about 9% higher performance.8Example 3Example 3A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?A LOAD/STORE machine has the characteristics shown below. We also observe that 25% of the ALU operations directly use a loaded value that is not used again. Thus we hope to improve things by adding new ALU instructions that have one source operand in memory. The CPI of the new instructions is 2. The only unpleasant consequence of this change is that the CPI of branch instructions will increase from 2 to 3. Overall, will CPU performance increase?Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 29Example 3 (Solution)Example 3 (Solution)Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2TIC 57.1T1.57 IC timecycleClock CPI IC timeCPU1.5720.24)0.12(0.2110.43 CPIBefore changeInstruction type Frequency CPIALU ops(0.43-x)/(1-x) 1Loads(0.21-x)/(1-x) 2Stores0.12/(1-x ) 2Branches0.24/(1-x) 3Reg-mem opsx/(1-x)2TIC 1.703 T908.1 IC)-(1 timecycleClock CPI IC timeCPU908.10.89251.7025-130.242)0.12-(0.211)-(0.43 CPI1075.040.43 xxxxxxAfter changeSince CPU time increases, change will not improve performance.10Example 4Example 4A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution

View Full Document


School:
Email:
New Password:
Confirm Password:

UNC-Chapel Hill COMP 206 - LECTURE NOTES

Sign up for free to view:

Please select your school