Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining - D196736

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 150> Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

DOC PREVIEW

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

School name University of California, Berkeley

Course Compsci 150- Components and Design Techniques for Digital System...

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Fall 2002 EECS150 - Lec23-HL3Page 1EECS150 - Digital DesignLecture 23 - High-level Design and Optimization 3, Parallelism and PipeliningNov 12, 2002John WawrzynekFall 2002 EECS150 - Lec23-HL3Page 2Parallelism • Example, Student final grade calculation:read mt1, ht2, mt3, project;grade = 0.2 × mt1 + 0.2 × mt2 + 0.2 × mt3 + 0.4 × project;write grade;• High performance hardware implementation:As many operations as possible are done in parallel.Parallelism is the act of doing more than one thing at a time.Optimization in hardware design often involves using parallelism to trade between cost and performance.xx xx+++0.2mt10.2 mt2 0.4 proj0.2 mt3gradeFall 2002 EECS150 - Lec23-HL3Page 3Parallelism• Is there a lower cost hardware implementation? Different tree organization?• Can factor out multiply by 0.2:• How about sharing operators (multipliers and adders)?x++0.2mt1 mt20.4 projmt3gradex+Fall 2002 EECS150 - Lec23-HL3Page 4Parallelism• Time multiplex single ALU for all adds and multiplies:• Attempts to minimize cost at the expense of time.– Need to add extra register, muxes, control.acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;controllerALUmt1 mt1mt3 projacc1acc2+ABCxxx+ABCD• If we adopt this approach, we can then consider the combinational hardware circuit diagram as an abstract computation-graph.• This technique “covers” the computation graph by performing the action of each node one at a time. (Sort of emulates it.)xx xx+++0.2mt10.2 mt2 0.4 proj0.2 mt3gradeUsing other primitives, othercoverings are possible.2Fall 2002 EECS150 - Lec23-HL3Page 5HW versus SW•This time-multiplexed ALUapproach is very similar to what a conventional software version would accomplish:• CPUs time-multiplex function units (ALUs, etc.)add r2,r1,r3add r2,r2,r4mult r2,r4,r5...• This model matches our tendency to express computation sequentially -even though many naturally contain parallelism.• Our programming languages also strengthen this tendency.• In hardware we have the ability to exploit problem parallelism - gives us a “knob” on performance/cost.• Maybe best to express computations as abstract computations graphs (rather than “programs”) - should lead to wider range of implementations.• Note: modern processors spend much of their cost budget attempting to restore execution parallelism: “super-scalar execution”.Fall 2002 EECS150 - Lec23-HL3Page 6Optimizing Iterative Computations• Hardware implementations of computations almost always involves looping. Why?• Is this true with software?• Are there programs without loops? – Maybe in “through away” code.• We probably would not bother building such a thing into hardware, would we? – (FPGA may change this.)• Fact is, our computations are closely tied to loops. Almost all our HW includes some looping mechanism.• What do we use looping for?Fall 2002 EECS150 - Lec23-HL3Page 7Optimizing Iterative ComputationsTypes of loops:1) Looping over input data (streaming):– ex: MP3 player, video compressor, music synthesizer.2) Looping over memory data– ex: vector inner product, matrix multiply, list-processing• These two are really very similar. 1) is often turned into 2) by buffering up input data, and processing “offline”. Even for “online” processing, buffers are used to smooth out temporary rate mismatches.3) CPUs are one big loop.– Instruction fetch ⇒ execute ⇒ Instruction fetch ⇒ execute ⇒ …– but change their personality with each iteration.4) Others?Loops offer opportunity for parallelismby executing more than one iteration at once,through parallel iteration execution &/or pipeliningFall 2002 EECS150 - Lec23-HL3Page 8Pipelining• With looping usually we are less interested in the latency of one iterationand more in the loop execution rate, or throughput.• These can be different due to parallel iteration execution &/or pipelining.• Pipelining review from CS61C:Analog to washing clothes:step 1: wash (20 minutes)step 2: dry (20 minutes)step 3: fold (20 minutes)60 minutes x 4 loads ⇒ 4 hourswash load1 load2 load3 load4dry load1 load2 load3 load4fold load1 load2 load3 load420 minoverlapped ⇒ 2 hours3Fall 2002 EECS150 - Lec23-HL3Page 9Pipeliningwash load1 load2 load3 load4dry load1 load2 load3 load4fold load1 load2 load3 load4• In the limit, as we increase the number of loads, the average time per load approaches 20 minutes.• The latency(time from start to end) for one load = 60 min.• The throughput = 3 loads/hour• The pipelined throughput ≈ # of pipe stages x un-pipelined throughput.Fall 2002 EECS150 - Lec23-HL3Page 10Pipelining• General principle:• Cut the CL block into pieces (stages) and separate with registers:T’ = 4ns + 1ns + 4ns +1ns = 10nsF = 1/(4ns +1ns) = 200MHz• CL block produces a new result every 5ns instead of every 9ns.CLOUTINTCL1OUTINT'CL2T1 T2Assume T=8nsTFF(setup +clk→q)=1nsF = 1/9ns = 111MHzAssume T1 = T2 = 4nsFall 2002 EECS150 - Lec23-HL3Page 11Limits on Pipelining• Without FF overhead, throughput improvement α # of stages.• After many stages are added. FF overhead begins to dominate:• Other limiters:– clock skew contributes to clock overhead– unequal stages– FFs dominate cost– clock distribution power consumption– feedback (dependencies between loop iterations)12345678500# of stagesthroughput(1/T)idealrealhalf the clock periodin FF overheadFF “overhead”is the setup and clk to Q times.Fall 2002 EECS150 - Lec23-HL3Page 12Example• F(x) = yi = a xi2+ b xi + c• x and y are assumed to be “streams”• Divide into 3 (nearly) equal stages.• Insert pipeline registers at dashed lines.• Can we pipeline basic operators?• Computation graph:F(x)xyxx+abcxx+y4Fall 2002 EECS150 - Lec23-HL3Page 13Example: Pipelined AdderFAs3a3b3FAs2a2b2FAs1a1b1FAs0a0b0c0FAs3a3b3FAs2a2b2FAs1a1b1FAs0a0b0c0FFreg regFF FFFall 2002 EECS150 - Lec23-HL3Page 14Pipelining Loops with Feedback• Example 1: yi= yi-1 + xi+ aunpipelined version:add1xi+yi-1xi+1+yiadd2yi yi+1Can we “cut” the feedback and overlap iterations?Try putting a register after add1:add1xi+yi-1xi+1+yiadd2yi yi+1“Loop carry dependency”+axi+yiyi-1• Can’t overlap the iterations because of the dependency.• The extra register doesn’t help the situation (actually hurts).• In general, can’t pipeline feedback loops.+axi+yiyi-1Fall 2002 EECS150 - Lec23-HL3Page

View Full Document

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Berkeley COMPSCI 150 - Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Sign up for free to view:

Please select your school