DOC PREVIEW
Berkeley COMPSCI 150 - Lecture 24 - High-level Design and Optimization 2, Parallelism and Pipelin

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Spring 2002 EECS150 - Lec24-HL3 Page 1EECS150 - Digital DesignLecture 24 - High-level Design andOptimization 2, Parallelism and PipeliningApril 23, 2002revised April 30John WawrzynekSpring 2002 EECS150 - Lec24-HL3 Page 2Parallelism• Example, Student final grade calculation:read mt1, ht2, mt3, project;grade = 0.2 ×× mt1 + 0.2 ×× mt2+ 0.2 ×× mt3 + 0.4 ×× project;write grade;• High performance hardware implementation:As many operations as possible are done in parallel.Optimization in hardware design involves using parallelism to trade between cost and performance.xx xx+++0.2 mt1 0.2 mt2 0.4 proj0.2 mt3gradeSpring 2002 EECS150 - Lec24-HL3 Page 3Parallelism• Is there a lower cost hardware implementation? Differenttree organization?• Can factor out multiply by 0.2:• How about sharing operators (multipliers and adders)?x++0.2mt1 mt20.4 projmt3gradex+Spring 2002 EECS150 - Lec24-HL3 Page 4Parallelism• Time multiplex single ALU forall adds and multiplies:• Attempts to minimize cost atthe expense of time.– Need to add extra register,muxes, control.acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;controllerALUmt1 mt1mt3 projacc1acc2+A BCxxx+A B C D• If we adopt this approach, we can then consider the combinationalhardware circuit diagram as an abstract computation-graph.• This technique “covers” the computation graph by performing theaction of each node one at a time. (Sort of emulates it.)xx xx+++0.2 mt1 0.2 mt2 0.4 proj0.2 mt3gradeUsing other primitives, othercoverings are possible.Spring 2002 EECS150 - Lec24-HL3 Page 5HW versus SW• This time-multiplexed ALUapproach is very similar to whata conventional software versionwould accomplish:• CPUs time-multiplex functionunits (ALUs, etc.)add r2,r1,r3add r2,r2,r4mult r2,r4,r5...• This model matches our tendency to express computation sequentially -even though many naturally contain parallelism.• Our programming languages also strengthen this tendency.• In hardware we have the ability to exploit problem parallelism - gives us a“knob” on performance/cost.• Maybe best to express computations as abstract computations graphs(rather than “programs”) - should lead to wider range of implementations.• Note: modern processors spend much of their cost budget attempting torestore execution parallelism: “super-scalar execution”.Spring 2002 EECS150 - Lec24-HL3 Page 6Power Consumption• The critical factor is the total energy for a particular computation.• As we trade cost for performance what happens to energy? 4 EMUX + 3 EADD + EWIRES 2 EMUX + 3 EADD + EWIRES 2 EMUX + 3 EADD + EMUXES + ECNTL + EWIRES• The lowest energy consumer is the solution that minimizes cost withouttime multiplexing operations.xx xx+++0.2 mt1 0.2 mt2 0.4 proj0.2 mt3gradex++0.2mt1 mt20.4 projmt3gradex+acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;controllerALUmt1 mt1mt3 projacc1acc2Spring 2002 EECS150 - Lec24-HL3 Page 7Optimizing Iterative Computations• Hardware implementations of computations almost always involveslooping. Why?• Is this true with software?• Are there programs without loops?– Maybe in “through away” code.• We probably would not bother building such a thing into hardware,would we?– (FPGA may change this.)• Fact is, our computations are closely tied to loops. Almost all our HWincludes some looping mechanism.• What do we use looping for?Spring 2002 EECS150 - Lec24-HL3 Page 8Optimizing Iterative ComputationsTypes of loops:1) Looping over input data (streaming):– ex: MP3 player, video compressor, music synthesizer.2) Looping over memory data– ex: vector inner product, matrix multiply, list-processing• These two are really very similar. 1) is often turned into 2) by buffering up inputdata, and processing “offline”. Even for “online” processing, buffers are used tosmooth out temporary rate mismatches.3) CPUs are one big loop.– Instruction fetch ⇒ execute ⇒ Instruction fetch ⇒ execute ⇒ …– but change their personality with each iteration.4) Others?Loops offer more opportunity for parallelismby executing more than one iteration at once,through parallel iteration execution &/or pipeliningSpring 2002 EECS150 - Lec24-HL3 Page 9Pipelining• With looping usually we are less interested in the latency of one iterationand more in the loop execution rate, or throughput.• These can be different due to parallel iteration execution &/or pipelining.• Pipelining review from CS61C:Analog to washing clothes:step 1: wash (20 minutes)step 2: dry (20 minutes)step 3: fold (20 minutes) 60 minutes x 4 loads ⇒ 4 hourswash load1 load2 load3 load4dry load1 load2 load3 load4fold load1 load2 load3 load420 minoverlapped ⇒ 2 hoursSpring 2002 EECS150 - Lec24-HL3 Page 10Pipeliningwash load1 load2 load3 load4dry load1 load2 load3 load4fold load1 load2 load3 load4• In the limit, as we increase the number of loads the average time perload approaches 20 minutes.• The latency (time from start to end) for one load = 60 min.• The throughput = 3 loads/hour• The pipelined throughput ≈ # of pipe stages x un-pipelined throughput.Spring 2002 EECS150 - Lec24-HL3 Page 11Pipelining• General principle:• Cut the block into pieces (stages) and separate with registers:T’ = 4ns + 1ns + 4ns +1ns = 10nsF = 1/(4ns +1ns) = 200MHz• CL block produces a new result every 5ns instead of every 9ns.CLOUTINTCL1OUTINT'CL2T1T2Assume T=8nsTFF(setup +clk→q)=1nsF = 1/9ns = 111MHzAssume T1 = T2 = 4nsSpring 2002 EECS150 - Lec24-HL3 Page 12Limits on Pipelining• Without FF overhead, throughput improvement α # of stages.• After many stages are added. FF overhead begins to dominate:• Other limiters:– clock skew contributes to clock overhead– unequal stages– FFs dominate cost– clock distribution power consumption– feedback (dependencies between loop iterations)1 2 3 4 5 6 7 8500# of stagesthroughput(1/T)idealrealhalf the clock periodin FF overheadSpring 2002 EECS150 - Lec24-HL3 Page 13Example• F(x) = yi + a xi2 + b xi + c• x and y are assumed to be“streams”• Divide into 3 (nearly) equal stages.• Insert pipeline registers at dashedlines.• Can we pipeline basic operators?• Computation graph:F(x)x yxx+abcxx+ySpring 2002 EECS150 - Lec24-HL3 Page 14Pipelined


View Full Document

Berkeley COMPSCI 150 - Lecture 24 - High-level Design and Optimization 2, Parallelism and Pipelin

Documents in this Course
Lab 2

Lab 2

9 pages

Debugging

Debugging

28 pages

Lab 1

Lab 1

15 pages

Memory

Memory

13 pages

Lecture 7

Lecture 7

11 pages

SPDIF

SPDIF

18 pages

Memory

Memory

27 pages

Exam III

Exam III

15 pages

Quiz

Quiz

6 pages

Problem

Problem

3 pages

Memory

Memory

26 pages

Lab 1

Lab 1

9 pages

Memory

Memory

5 pages

Load more
Download Lecture 24 - High-level Design and Optimization 2, Parallelism and Pipelin
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 24 - High-level Design and Optimization 2, Parallelism and Pipelin and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 24 - High-level Design and Optimization 2, Parallelism and Pipelin 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?