DOC PREVIEW
Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS150 Digital Design Lecture 23 High level Design and Optimization 3 Parallelism and Pipelining April 15 2003 John Wawrzynek Spring 2003 EECS150 Lec23 HL3 Page 1 Parallelism Parallelism is the act of doing more than one thing at a time Optimization in hardware design often involves using parallelism to trade between cost and performance Example Student final grade calculation read mt1 ht2 mt3 project grade 0 2 mt1 0 2 mt2 0 2 mt3 0 4 project write grade High performance hardware implementation 0 2 mt1 0 2 mt2 0 2 mt3 0 4 proj x x x x grade As many operations as possible are done in parallel Spring 2003 EECS150 Lec23 HL3 Page 2 1 Parallelism Is there a lower cost hardware implementation Different tree organization Can factor out multiply by 0 2 mt1 0 4 proj mt2 x mt3 0 2 x grade How about sharing operators multipliers and adders Spring 2003 EECS150 Lec23 HL3 Page 3 Time Multiplexing Time multiplex single ALU for all adds and multiplies Attempts to minimize cost at the expense of time Need to add extra register muxes control controller acc1 mt1 mt2 acc1 acc1 mt3 acc1 0 2 x acc1 acc2 0 4 x proj grade acc1 acc2 mt1 mt1 mt3 proj ALU acc1 acc2 If we adopt this approach we can then consider the combinational hardware circuit diagram as an abstract computation graph 0 2 mt1 0 2 mt2 0 2 mt3 0 4 proj x x x x Using other primitives other coverings are possible A B C A B x x C grade D x This technique covers the computation graph by performing the action of each node one at a time Sort of emulates it Spring 2003 EECS150 Lec23 HL3 Page 4 2 HW versus SW This time multiplexed ALU approach is very similar to what a conventional software version would accomplish CPUs time multiplex function units ALUs etc add r2 r1 r3 add r2 r2 r4 mult r2 r4 r5 This model matches our tendency to express computation sequentially even though many naturally contain parallelism Our programming languages also strengthen this tendency In hardware we have the ability to exploit problem parallelism gives us a knob on performance cost Maybe best to express computations as abstract computations graphs rather than programs should lead to wider range of implementations Note modern processors spend much of their cost budget attempting to restore execution parallelism super scalar execution Spring 2003 EECS150 Lec23 HL3 Page 5 Optimizing Iterative Computations Hardware implementations of computations almost always involves looping Why Is this true with software Are there programs without loops Maybe in through away code We probably would not bother building such a thing into hardware would we FPGA may change this Fact is our computations are closely tied to loops Almost all our HW includes some looping mechanism What do we use looping for Spring 2003 EECS150 Lec23 HL3 Page 6 3 Optimizing Iterative Computations Types of loops 1 Looping over input data streaming ex MP3 player video compressor music synthesizer 2 Looping over memory data ex vector inner product matrix multiply list processing 1 2 are really very similar 1 is often turned into 2 by buffering up input data and processing offline Even for online processing buffers are used to smooth out temporary rate mismatches 3 CPUs are one big loop Instruction fetch execute Instruction fetch execute but change their personality with each iteration 4 Others Loops offer opportunity for parallelism by executing more than one iteration at once through parallel iteration execution or pipelining Spring 2003 EECS150 Lec23 HL3 Page 7 Pipelining With looping usually we are less interested in the latency of one iteration and more in the loop execution rate or throughput These can be different due to parallel iteration execution or pipelining Pipelining review from CS61C Analog to washing clothes step 1 wash 20 minutes step 2 dry 20 minutes step 3 fold 20 minutes 60 minutes x 4 loads 4 hours wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 20 min overlapped 2 hours Spring 2003 EECS150 Lec23 HL3 Page 8 4 Pipelining wash dry fold load1 load2 load3 load4 load1 load2 load3 load4 load1 load2 load3 load4 In the limit as we increase the number of loads the average time per load approaches 20 minutes The latency time from start to end for one load 60 min The throughput 3 loads hour The pipelined throughput of pipe stages x un pipelined throughput Spring 2003 EECS150 Lec23 HL3 Page 9 Pipelining General principle T Assume T 8ns TFF setup clk q 1ns F 1 9ns 111MHz Cut the CL block into pieces stages and separate with registers CL IN OUT T IN CL1 CL2 T1 T2 OUT Assume T1 T2 4ns T 4ns 1ns 4ns 1ns 10ns F 1 4ns 1ns 200MHz CL block produces a new result every 5ns instead of every 9ns Spring 2003 EECS150 Lec23 HL3 Page 10 5 Limits on Pipelining Without FF overhead throughput improvement of stages After many stages are added FF overhead begins to dominate ideal FF overhead is the setup and clk to Q times 500 real throughput 1 T half the clock period in FF overhead 1 4 5 6 of stages Other limiters to effective pipelining 2 3 7 8 clock skew contributes to clock overhead unequal stages FFs dominate cost clock distribution power consumption feedback dependencies between loop iterations Spring 2003 EECS150 Lec23 HL3 Page 11 Pipelining Example F x yi a xi2 b xi c Computation graph x x F x y a x x and y are assumed to be streams Divide into 3 nearly equal stages Insert pipeline registers at dashed lines Can we pipeline basic operators Spring 2003 x EECS150 Lec23 HL3 x b c y Page 12 6 Example Pipelined Adder b3 a3 b2 a2 b1 a1 b0 a0 FA FA FA FA c0 s3 b3 a3 s2 b2 a2 s1 s0 b1 a1 b0 a0 reg reg FA FA FA FA FF FF s1 s0 s3 Spring 2003 FF s2 c0 EECS150 Lec23 HL3 Page 13 Pipelining Loops with Feedback Loop carry dependency Example 1 yi yi 1 xi a unpipelined version xi 1 yi add1 xi yi 1 add2 yi yi 1 Can we cut the feedback and overlap iterations Try putting a register after add1 add1 xi yi 1 xi 1 yi add2 yi yi 1 Spring 2003 xi a yi 1 yi xi a yi EECS150 Lec23 HL3 yi 1 Can t overlap the iterations because of the dependency The extra register doesn t help the situation actually hurts In general can t pipeline feedback loops Page 14 7 Pipelining Loops with Feedback Loop carry dependency However we can overlap the nonfeedback part of the iterations xi a Add is associative and communitive Therefore we can reorder the computation to shorten the delay of the feedback path xi a yi yi yi 1 xi a a xi yi 1 add1 add2 yi 1 Shorten the feedback path xi a xi 1 a xi 2 a yi yi 1 yi 2 yi 1 yi Pipelining


View Full Document

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Documents in this Course
Lab 2

Lab 2

9 pages

Debugging

Debugging

28 pages

Lab 1

Lab 1

15 pages

Memory

Memory

13 pages

Lecture 7

Lecture 7

11 pages

SPDIF

SPDIF

18 pages

Memory

Memory

27 pages

Exam III

Exam III

15 pages

Quiz

Quiz

6 pages

Problem

Problem

3 pages

Memory

Memory

26 pages

Lab 1

Lab 1

9 pages

Memory

Memory

5 pages

Load more
Download Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?