Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining - D2825893

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 150> Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

DOC PREVIEW

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

School name University of California, Berkeley

Course Compsci 150- Components and Design Techniques for Digital System...

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EECS150 Digital Design Lecture 23 High level Design and Optimization 3 Parallelism and Pipelining April 15 2003 John Wawrzynek Spring 2003 EECS150 Lec23 HL3 Page 1 Parallelism Parallelism is the act of doing more than one thing at a time Optimization in hardware design often involves using parallelism to trade between cost and performance Example Student final grade calculation read mt1 ht2 mt3 project grade 0 2 mt1 0 2 mt2 0 2 mt3 0 4 project write grade High performance hardware implementation 0 2 mt1 0 2 mt2 0 2 mt3 0 4 proj x x x x grade As many operations as possible are done in parallel Spring 2003 EECS150 Lec23 HL3 Page 2 1 Parallelism Is there a lower cost hardware implementation Different tree organization Can factor out multiply by 0 2 mt1 0 4 proj mt2 x mt3 0 2 x grade How about sharing operators multipliers and adders Spring 2003 EECS150 Lec23 HL3 Page 3 Time Multiplexing Time multiplex single ALU for all adds and multiplies Attempts to minimize cost at the expense of time Need to add extra register muxes control controller acc1 mt1 mt2 acc1 acc1 mt3 acc1 0 2 x acc1 acc2 0 4 x proj grade acc1 acc2 mt1 mt1 mt3 proj ALU acc1 acc2 If we adopt this approach we can then consider the combinational hardware circuit diagram as an abstract computation graph 0 2 mt1 0 2 mt2 0 2 mt3 0 4 proj x x x x Using other primitives other coverings are possible A B C A B x x C grade D x This technique covers the computation graph by performing the action of each node one at a time Sort of emulates it Spring 2003 EECS150 Lec23 HL3 Page 4 2 HW versus SW This time multiplexed ALU approach is very similar to what a conventional software version would accomplish CPUs time multiplex function units ALUs etc add r2 r1 r3 add r2 r2 r4 mult r2 r4 r5 This model matches our tendency to express computation sequentially even though many naturally contain parallelism Our programming languages also strengthen this tendency In hardware we have the ability to exploit problem parallelism gives us a knob on performance cost Maybe best to express computations as abstract computations graphs rather than programs should lead to wider range of implementations Note modern processors spend much of their cost budget attempting to restore execution parallelism super scalar execution Spring 2003 EECS150 Lec23 HL3 Page 5 Optimizing Iterative Computations Hardware implementations of computations almost always involves looping Why Is this true with software Are there programs without loops Maybe in through away code We probably would not bother building such a thing into hardware would we FPGA may change this Fact is our computations are closely tied to loops Almost all our HW includes some looping mechanism What do we use looping for Spring 2003 EECS150 Lec23 HL3 Page 6 3 Optimizing Iterative Computations Types of loops 1 Looping over input data streaming ex MP3 player video compressor music synthesizer 2 Looping over memory data ex vector inner product matrix multiply list processing 1 2 are really very similar 1 is often turned into 2 by buffering up input data and processing offline Even for online processing buffers are used to smooth out temporary rate mismatches 3 CPUs are one big loop Instruction fetch execute Instruction fetch execute but change their personality with each iteration 4 Others Loops offer opportunity for parallelism by executing more than one iteration at once through parallel iteration execution or pipelining Spring 2003 EECS150 Lec23 HL3 Page 7 Pipelining With looping usually we are less interested in the latency of one iteration and more in the loop execution rate or throughput These can be different due to parallel iteration execution or pipelining Pipelining review from CS61C Analog to washing clothes step 1 wash 20 minutes step 2 dry 20 minutes step 3 fold 20 minutes 60 minutes x 4 loads 4 hours wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 20 min overlapped 2 hours Spring 2003 EECS150 Lec23 HL3 Page 8 4 Pipelining wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 In the limit as we increase the number of loads the average time per load approaches 20 minutes The latency time from start to end for one load 60 min The throughput 3 loads hour The pipelined throughput of pipe stages x un pipelined throughput Spring 2003 EECS150 Lec23 HL3 Page 9 Pipelining General principle T Assume T 8ns TFF setup clk q 1ns F 1 9ns 111MHz Cut the CL block into pieces stages and separate with registers CL IN OUT T IN CL1 CL2 T1 T2 OUT Assume T1 T2 4ns T 4ns 1ns 4ns 1ns 10ns F 1 4ns 1ns 200MHz CL block produces a new result every 5ns instead of every 9ns Spring 2003 EECS150 Lec23 HL3 Page 10 5 Limits on Pipelining Without FF overhead throughput improvement of stages After many stages are added FF overhead begins to dominate ideal FF overhead is the setup and clk to Q times 500 real throughput 1 T half the clock period in FF overhead 1 4 5 6 of stages Other limiters to effective pipelining 2 3 7 8 clock skew contributes to clock overhead unequal stages FFs dominate cost clock distribution power consumption feedback dependencies between loop iterations Spring 2003 EECS150 Lec23 HL3 Page 11 Pipelining Example F x yi a xi2 b xi c Computation graph x x F x y a x x and y are assumed to be streams Divide into 3 nearly equal stages Insert pipeline registers at dashed lines Can we pipeline basic operators Spring 2003 x EECS150 Lec23 HL3 x b c y Page 12 6 Example Pipelined Adder b3 a3 b2 a2 b1 a1 b0 a0 FA FA FA FA c0 s3 b3 a3 s2 b2 a2 s1 s0 b1 a1 b0 a0 reg reg FA FA FA FA FF FF s1 s0 s3 Spring 2003 FF s2 c0 EECS150 Lec23 HL3 Page 13 Pipelining Loops with Feedback Loop carry dependency Example 1 yi yi 1 xi a unpipelined version xi 1 yi add1 xi yi 1 add2 yi yi 1 Can we cut the feedback and overlap iterations Try putting a register after add1 add1 xi yi 1 xi 1 yi add2 yi yi 1 Spring 2003 xi a yi 1 yi xi a yi EECS150 Lec23 HL3 yi 1 Can t overlap the iterations because of the dependency The extra register doesn t help the situation actually hurts In general can t pipeline feedback loops Page 14 7 Pipelining Loops with Feedback Loop carry dependency However we can overlap the nonfeedback part of the iterations xi a Add is associative and communitive Therefore we can reorder the computation to shorten the delay of the feedback path xi a yi yi yi 1 xi a a xi yi 1 add1 add2 yi 1 Shorten the feedback path xi a xi 1 a xi 2 a yi yi 1 yi 2 yi 1 yi Pipelining

View Full Document

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Sign up for free to view:

Please select your school