CS152 Computer Architecture and Engineering Lecture 25 Low Power Design, Advanced Intel ProcessorsRecap: I/O SummarySlides Borrowed from Bob BrodersonPowerPoint PresentationSlide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Back to original goal: Processor Usage ModelTypical UsageAnother approach: Reduce FrequencyAlternative: Dynamic Voltage ScalingWhat about bus transitions?ReasoningHuffman-based CompressionContext-based encoderJust the Shift-register: “window-based”Administrivia7 Talk Commandments for a Bad TalkFollowing all the commandmentsAlternatives to a Bad TalkInclude in your final presentationReview: Road to Faster ProcessorsSlide 41Dynamic Scheduling in Pentium Pro, II, IIIDynamic Scheduling in P6 (Pentium Pro, II, III)P6 PipelineP6 Block DiagramDynamic Scheduling in P6Pentium III Die PhotoP6 Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz busP6 Performance: Speculation rate (% instructions issued that do not commit)P6 Performance: mops commit/clockP6 Dynamic Benefit? Sum of parts CPI vs. Actual CPIPentium 4 featuresPentium 4 features (Continued)RegistersSIMD: Single Instruction Multiple DataPentium 4 CachePentium 4 basic block diagramPentium 4 Trace Cache 1/4Trace Cache ExampleSlide 60Slide 61Slide 62Full Block diagram (Intel)Out-of-Order Execution -- PipelineComparison of two architecturesRegister Renaming: Pentium III vs NetBurstStaggered ALU AddPentium 4 Speeds & FeedsPentium 4 Basic FeaturesSlide 70Performance ComparisonSPEC 2000 Performance 3/2001 Source: Microprocessor Report,Conclusion: PowerConclusion: IntelCS152Computer Architecture and Engineering Lecture 25Low Power Design,Advanced Intel ProcessorsMay 3, 2004John Kubiatowicz (http.cs.berkeley.edu/~kubitron)lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.2Recap: I/O Summary°I/O performance limited by weakest link in chain between OS and device°Queueing theory is important•100% utilization means very large latency•Remember, for M/M/1 queue (exponential source of requests/service)-queue size goes as u/(1-u)-latency goes as Tser×u/(1-u)•For M/G/1 queue (more general server, exponential sources)-latency goes as m1(z) x u/(1-u) = Tser x {1/2 x (1+C)} x u/(1-u)°Three Components of Disk Access Time:•Seek Time: advertised to be 8 to 12 ms. May be lower in real life.•Rotational Latency: 4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM•Transfer Time: 2 to 50 MB per second°I/O device notifying the operating system:•Polling: it can waste a lot of processor time•I/O interrupt: similar to exception except it is asynchronous°Delegating I/O responsibility from the CPU: DMA, or even IOP5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.3Slides Borrowed from Bob BrodersonLow Power Design5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.45/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.55/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.65/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.75/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.85/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.95/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.105/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.115/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.125/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.133/4 1/4 = 3/163/165/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.145/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.155/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.165/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.175/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.185/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.195/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.205/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.215/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.225/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.235/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.245/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.255/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.26timeDesiredThroughputSingle-user systemCeiling:Background andCompute-intensive andSystem Optimizations:• Maximize Peak Throughput• Minimize Average Energy/operationof the processor Set by top speedhigh-latency processeslow-latency processes(maximize computation per battery life)not always computingBack to original goal: Processor Usage Model5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.27Typical UsageDeliveredThroughputAlways high throughput PeakWake up Compute ASAP Go to idle/sleep modeAlways high energy/operationExcess throughputtime5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.28Another approach: Reduce FrequencyfCLKReducedDeliveredThroughputPeaktimeEnergy/operation remains unchanged...while throughput scales down with fCLKProblems: • Circuits designed to be fast are now “wasted”.• Demand for peak throughput not met.SlowFastPowerBookControl PanelFrequency set by user5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.29Alternative: Dynamic Voltage ScalingDynamically scale energy/operation with throughputExtend battery life by up to 10xwith the same hardware!DeliveredThroughputPeakReduce throughput & fCLK,Reduce energy/operationKey: Process scheduler determines operating point.time5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.30What about bus transitions?°Can we reduce total number of transitions on buses by sophisticated bus drivers?°Can we encode information in a way that takes less power?•Do this on chip?!•Trying to reduce total number of transitionsEncoded VersionDecodeEncoderOutputInput5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.31Reasoning°Increasing importance of wires relative to transistors•Spend transistors to drive wires more efficiently?•Try to reduce transitions over wires°Orthogonal to other power-saving techniques•I.e. voltage reduction, low-swing drive•clock gating•Parallelism (like vectors!)5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.32Huffman-based Compression°Variable bit length – problem!°Possible soln: macro clock°Less bits != less transitions…DecodeEncoderOutputInput5/03/04 ©UCB Spring 2004CS152 / Kubiatowicz Lec25.33Context-based encoder°Context-based encoder•Detecting of repeated values going
View Full Document