CS152 Computer Architecture and Engineering Lecture 25 Low Power Design Advanced Intel Processors May 3 2004 John Kubiatowicz http cs berkeley edu kubitron lecture slides http www inst eecs berkeley edu cs152 Recap I O Summary I O performance limited by weakest link in chain between OS and device Queueing theory is important 100 utilization means very large latency Remember for M M 1 queue exponential source of requests service queue size goes as u 1 u latency goes as Tser u 1 u For M G 1 queue more general server exponential sources latency goes as m1 z x u 1 u Tser x 1 2 x 1 C x u 1 u Three Components of Disk Access Time Seek Time advertised to be 8 to 12 ms May be lower in real life Rotational Latency 4 1 ms at 7200 RPM and 8 3 ms at 3600 RPM Transfer Time 2 to 50 MB per second I O device notifying the operating system Polling it can waste a lot of processor time I O interrupt similar to exception except it is asynchronous Delegating I O responsibility from the CPU DMA or even IOP 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Slides Borrowed from Bob Broderson Low Power Design 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 3 4 1 4 3 16 3 16 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Back to original goal Processor Usage Model Desired Compute intensive and Throughput low latency processes by top speed Ceiling Set of the processor Single user system not always computing Background and high latency processes time System Optimizations Maximize Peak Throughput Minimize Average Energy operation maximize computation per battery life 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Typical Usage Delivered Throughput Excess throughput Peak time Wake up Compute ASAP Go to idle sleep mode Always high throughput Always high energy operation 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Another approach Reduce Frequency Delivered Throughput Peak Frequency set by user PowerBook Control Panel Slow Fast fCLK Reduced time Energy operation remains unchanged while throughput scales down with fCLK Problems Circuits designed to be fast are now wasted Demand for peak throughput not met 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Alternative Dynamic Voltage Scaling Delivered Throughput Reduce throughput fCLK Reduce energy operation Peak time Dynamically scale energy operation with throughput Extend battery life by up to 10x with the same hardware Key Process scheduler determines operating point 5 03 04 UCB Spring 2004 CS152 Kubiatowicz What about bus transitions Encoded Version Input Decod Output e Encod er Can we reduce total number of transitions on buses by sophisticated bus drivers Can we encode information in a way that takes less power Do this on chip Trying to reduce total number of transitions 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Reasoning Increasing importance of wires relative to transistors Spend transistors to drive wires more efficiently Try to reduce transitions over wires Orthogonal to other power saving techniques I e voltage reduction low swing drive clock gating Parallelism like vectors 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Huffman based Compression Input Decod Output e Encod er Variable bit length problem Possible soln macro clock Less bits less transitions 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Context based encoder Context based encoder Detecting of repeated values going across bus Shift register finds short term frequent values Frequency table holds long term values 5 03 04 UCB Spring 2004 CS152 Kubiatowicz Just the Shift register windowbased Focus on shift register 8 or 16 entries Careful Design can break even Register bus results 16 entires break even 7 0 mm for 13 m 5 03 04 UCB Spring 2004 2 7 mm for 07 m CS152 Kubiatowicz Administrivi a Pending schedule Wednesday 5 5 Midterm II 5 30 8 30 306 Soda hall No class that day I will be having office hours 1 page of handwritten notes both sides Fair topics Pipelining Memory Systems I O Disks Queueing Theory Power Pizza at LaVal s afterwards Monday 5 10 wrap up evaluations etc Thursday 5 13 Oral reports Times TBA Signup sheet will be on my office door next week Project reports must be submitted via web by 5pm on 5 10 Monday 5 17 Final project reports due Oral Report Powerpoint 20 minute presentation 5 UCB 5 03 04 Spring 2004 minutes for questions CS152 Kubiatowicz 7 Talk Commandments for a Bad Talk I Thou shalt not illustrate II Thou shalt not covet brevity III Thou shalt not print large IV Thou shalt not use color V Thou shalt not skip slides in a long talk VI Thou shalt cover thy naked slides VII 5 03 04 Thou shalt not practice UCB Spring 2004 CS152 Kubiatowicz Following all the commandments We describe the philosophy and design of the control flow machine and present the results of detailed simulations of the performance of a single processing element Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code It is shown that the control flow processor compares favorablylism in the program We present a denotational semantics for a logic program to construct a control flow for the logic program The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions We also present a bottom up compilation of medium grain clusters from a fine grain control flow graph We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters Our compiling strategy is to exploit coarse grain parallelism at function application level and the function application level parallelism is implemented by fork join mechanism The
View Full Document
Unlocking...