UT CS 429H - Pipelining I - D511035

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 429H> Pipelining I

UT CS 429H - Pipelining I

School name University of Texas at Austin

Course Cs 429h- Computer Organization and Architecture: Honors

Pages 30

Download Save

Unformatted text preview:

Systems I Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Overview What s wrong with the sequential SEQ Y86 It s slow Each piece of hardware is used only a small fraction of time We would like to find a way to get more performance with only a little more hardware General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging SEQ Inserting pipeline registers Problems with data and control hazards 2 Real World Pipelines Car Washes Sequential Parallel Pipelined Idea Divide process into independent stages Move objects through stages in sequence At any given times multiple objects being processed 3 Laundry example Ann Brian Cathy Dave each have one load of clothes to wash dry and fold A B C D Washer takes 30 minutes Dryer takes 30 minutes Folder takes 30 minutes Stasher takes 30 minutes to put clothes into drawers Slide courtesy of D Patterson 4 Sequential Laundry 6 PM T a s k O r d e r A 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time B C D Sequential laundry takes 8 hours for 4 loads If they learned pipelining how long would laundry take Slide courtesy of D Patterson 5 Pipelined Laundry Start ASAP 6 PM T a s k 7 8 9 30 30 30 30 30 30 30 10 11 12 1 2 AM Time A B C O D r d e r Pipelined laundry takes 3 5 hours for 4 loads Slide courtesy of D Patterson 6 Pipelining Lessons 6 PM T a s k 7 8 Time 30 30 30 30 30 30 30 A B O r d e r 9 Pipelining doesn t help latency of single task it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup Number pipe stages C Pipeline rate limited by slowest pipeline stage D Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences Slide courtesy of D Patterson 7 Latency and Throughput Latency time to complete an operation Throughput work completed per unit time Consider plumbing Low latency turn on faucet and water comes out High bandwidth lots of water e g to fill a pool What is High speed Internet Low latency needed to interactive gaming High bandwidth needed for downloading large files Marketing departments like to conflate latency and bandwidth 8 Relationship between Latency and Throughput Latency and bandwidth only loosely coupled Henry Ford assembly lines increase bandwidth without reducing latency My factory takes 1 day to make a Model T ford But I can start building a new car every 10 minutes At 24 hrs day I can make 24 6 144 cars per day A special order for 1 green car still takes 1 day Throughput is increased but latency is not Latency reduction is difficult Often one can buy bandwidth E g more memory chips more disks more computers Big server farms e g google are high bandwidth 9 Computational Example 300 ps 20 ps Combinational logic R e g Delay 320 ps Throughput 3 12 GOPS Clock System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Must have clock cycle of at least 320 ps 10 3 Way Pipelined Version 100 ps 20 ps 100 ps 20 ps 100 ps Comb logic A R e g Comb logic B R e g Comb logic C 20 ps R Delay 360 ps e Throughput 8 33 GOPS g Clock System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A Begin new operation every 120 ps Overall latency increases 360 ps from start to finish 11 Pipeline Diagrams Unpipelined OP1 OP2 OP3 Time Cannot start new operation until previous one completes 3 Way Pipelined OP1 OP2 A B C A B C A B OP3 C Time Up to 3 operations in process simultaneously 12 Operating a Pipeline 239 241 300 359 Clock OP1 A OP2 B C A B C A B OP3 0 120 240 360 C 480 640 Time 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb logic A R e g Comb logic B R e g Comb logic C R e g Clock 13 Limitations Nonuniform Delays 50 ps 20 ps 150 ps 20 ps 100 ps Comb logic R e g Comb logic B R e g Comb logic C A OP1 OP2 A B OP3 B A R Delay 510 ps e Throughput 5 88 GOPS g Clock C A 20 ps C B C Time Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages 14 Limitations Register Overhead 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb logic R e g Comb logic R e g Comb logic R e g Clock R e g Comb logic R e g Comb logic R e g Delay 420 ps Throughput 14 29 GOPS As try to deepen pipeline overhead of loading registers becomes more significant Percentage of clock cycle spent loading register 1 stage pipeline 3 stage pipeline 6 stage pipeline Comb logic 6 25 16 67 28 57 High speeds of modern processor designs obtained through very deep pipelining 15 CPU Performance Equation 3 components to execution time CPU time Seconds Seconds Instructions Cycles Program Program Instruction Cycle Factors affecting CPU execution time Program Compiler Inst Set Organization MicroArch Technology Inst Count X X X CPI X X X X Clock Rate X X X X Consider all three elements when optimizing Workloads change 16 Cycles Per Instruction CPI Depends on the instruction CPI i Execution time of instruction i Clock Rate Average cycles per instruction n CPI CPI i Fi i 1 Example Op ALU Load Store Branch Freq 50 20 10 20 where Fi ICi ICtot Cycles CPI i time 1 0 5 33 2 0 4 27 2 0 2 13 2 0 4 27 CPI total 1 5 17 Comparing and Summarizing Performance Fair way to summarize performance Capture in a single number Example Which of the following machines is best Computer A Program 1 1 Program 2 1000 Total Time 1001 Computer B 10 100 110 Computer C 20 20 40 18 Means Arithmetic mean Geometric mean 1 n Ti n i 1 n Ti i 1 1 n Can be weighted aiTi Represents total execution time Should not be used for aggregating normalized numbers Consistent independent of reference Best for combining results Best for normalized results 1 n ln Geo ln Ti n i 1 19 What is the geometric mean of 2 and 8 A 5 B 4 20 Is Speed the Last Word in Performance Depends on the application Cost Not just processor but other components ie memory Power consumption Trade power for performance in many applications …

View Full Document


School:
Email:
New Password:
Confirm Password:

UT CS 429H - Pipelining I

Sign up for free to view:

Please select your school