New version page

Berkeley COMPSCI 150 - Lecture 22 - High-level Design and Optimization 2

Documents in this Course
Lab 2

Lab 2

9 pages

Debugging

Debugging

28 pages

Lab 1

Lab 1

15 pages

Memory

Memory

13 pages

Lecture 7

Lecture 7

11 pages

SPDIF

SPDIF

18 pages

Memory

Memory

27 pages

Exam III

Exam III

15 pages

Quiz

Quiz

6 pages

Problem

Problem

3 pages

Memory

Memory

26 pages

Lab 1

Lab 1

9 pages

Memory

Memory

5 pages

Load more

This preview shows page 1-2-3-4-5 out of 16 pages.

View Full Document
View Full Document

End of preview. Want to read all 16 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Spring 2003 EECS150 - Lec22-HL2Page 1EECS150 - Digital DesignLecture 22 - High-level Design and Optimization 2April 10, 2003John WawrzynekSpring 2003 EECS150 - Lec22-HL2Page 2List Processor Example• Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address 0:• All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is 0.I/Os:– START resets to head of list and starts addition process.– DONE signals completion– R, Bus that holds the final resultSpring 2003 EECS150 - Lec22-HL2Page 3Architecture #1Direct implementation of RTL description:DatapathControllerIf (START==1) NEXTÅ0, SUMÅ0;repeat {SUMÅSUM + Memory[NEXT+1];NEXTÅMemory[NEXT];} until (NEXT==0);RÅSUM, DONEÅ1; A_SEL01NEXT01+MemoryDA==0+01SUMNEXT_SELLD_NEXTNEXT_ZEROSUM_SELLD_SUM010T > 31ns, F < 32 MHzSpring 2003 EECS150 - Lec22-HL2Page 4Cycle Optimized Version• Architecture #2:• Incremental cost: addition of another register and mux. If (START==1) NEXTÅ0, SUMÅ0, NUMAÅ1;repeat {SUMÅSUM + Memory[NUMA];NUMAÅMemory[NEXT] + 1, NEXTÅMemory[NEXT] ;} until (NEXT==0);RÅSUM, DONEÅ1; A_SEL01NEXT01+MemoryDA==0+01SUMNEXT_SELLD_NEXTNEXT_ZEROSUM_SELLD_SUM01001NUMANEXT_SELLD_NEXT1T > 23ns, F < 43MhzSpring 2003 EECS150 - Lec22-HL2Page 5Cycle Optimized with reduced cost• Incremental cost:– Addition of another mux and control. Removal of an 8-bit adder.• Performance:– mux adds 1ns to cycle time. 24ns, 41.67MHz. • Is the cost savings worth the performance degradation?A_SEL01NEXT01MemoryDA==0+01SUMNEXT_SELLD_NEXTNEXT_ZEROSUM_SELLD_SUM0001NUMANEXT_SELLD_NEXT101ADD_SEL1Spring 2003 EECS150 - Lec22-HL2Page 6Resource Utilization Charts• One way to visualize these (and other possible) optimizations isthrough the use of a resource utilization charts.• These are used in high-level design to help schedule operations on shared resources.• Resources are listed on the y-axis. Time (in cycles) on the x-axis.•Example:memory fetch A1 fetch A2bus fetch A1 fetch A2register-file read B1 read B2ALU A1+B1 A2+B2cycle 1 234567• Our list processor has two major resources: memory and adderSpring 2003 EECS150 - Lec22-HL2Page 7List Example Resource Scheduling• Unoptimized solution: 1. SUMÅSUM + Memory[NEXT+1]; 2. NEXTÅMemory[NEXT];memory fetch x fetch next fetch x fetch nextadder1 next+1 next+1adder2 sum sum1212• Optimized solution: 1. SUMÅSUM + Memory[NUMA]; 2. NEXTÅMemory[NEXT], NUMAÅMemory[NEXT]+1;memory fetch x fetch next fetch x fetch nextadder sum numa sum numa• How about the other combination: add x registermemory fetch x fetch next fetch x fetch nextadder numa sum numa sum1. XÅMemory[NUMA], NUMAÅNEXT+1;2. NEXTÅMemory[NEXT], SUMÅSUM+X;• Does this work? If so, a very short clock period. Each cycle could have independent fetch and add. T = max(Tmem, Tadd) instead of Tmem+ Tadd.Spring 2003 EECS150 - Lec22-HL2Page 8List Example Resource Scheduling• First schedule one loop iteration:• How can we overlap interations? next2depends on next1.– “slide” second interation into first:– or further:The repeating pattern is 4 cycles. Not exactly the pattern what we were looking for. But does it work correctly?Memory next1 x1next2 x2adder numa1 sum1numa2 sum2Memory next1 x1 next2 x2adder numa1 sum1numa2 sum2Memory next1 next2 x1x2 next3 next4 x3x4adder numa1 numa2 sum1sum2numa3 numa4 sum3 sum4Spring 2003 EECS150 - Lec22-HL2Page 9List Example Resource Scheduling• In this case, first spread out, then pack.1. XÅMemory[NUMA], NUMAÅNEXT+1;2. NEXTÅMemory[NEXT], SUMÅSUM+X;• Three different loop iterations active at once.• Short cycle time (no dependencies within a cycle)• full utilization• Initialization: x=0, numa=1, sum=0, next=memory[0]• Extra control states (out of the loop)– one to initialize next– one to finish off. 2 cycles after next==0.Memory next1 x1adder numa1 sum1Memory next1 next2 x1 next3x2 next4x3adder numa1 numa2sum1numa3 sum2 numa4 sum3Spring 2003 EECS150 - Lec22-HL2Page 105. Optimization, Architecture #4• Datapath:• Incremental cost:– Addition of another register & mux, adder mux, and control.• Performance: find max time of the four actions1. XÅMemory[NUMA], 0.5+1+10+1+1+0.5 = 14nsNUMAÅNEXT+1; same for all ⇒ T>14ns, F<71MHz2. NEXTÅMemory[NEXT], SUMÅSUM+X;SUM+NEXTMemory= 0 ?0 11 001AD0LD_SUMLD_NEXTNEXT_SELSUM_SELA_SEL0NEXT_ZERONUMA1 0LD_NEXTNEXT_SEL11 0ADD_SEL11 X 0 10X_SELLD_X1 0ADD_SEL2LD_NUMASpring 2003 EECS150 - Lec22-HL2Page 11Other Optimizations• Node alignment restriction:– If the application of the list processor allows us to restrict the placement of nodes in memory so that they are aligned on even multiples of 2 bytes.• NUMA addition can be eliminated.• Controller supplies “0” for low-bit of memory address for NEXT, and “1” for X.– Furthermore, if we could use a memory with a 16-bit wide output, then could fetch entire node in one cycle:{NEXT, X} Å Memory[NEXT], SUM Å SUM + X;⇒ execution time cut in half (half as many cycles)Spring 2003 EECS150 - Lec22-HL2Page 12List Processor Conclusions• Through careful optimization:– clock frequency increased from 32MHz to 71MHz – little (if any) cost increase. • “Scheduling” was used to overlap and to maximize use of resources.• Questions:– Consider the design process we went through:– Could a computer program go from RTL description to circuits automatically?– Could a computer program derive the optimizations that we did?Spring 2003 EECS150 - Lec22-HL2Page 13Modulo Scheduling• Review of list processor scheduling:• We “spread” out the schedule of one iteration to allow efficientpacking?• The goal of modulo scheduling is to find the schedule for one characteristic section of the computation. This is the part the control loops over.• The entire schedule can then be derived, by repeating the characteristic section or repeating it with missing pieces.Memory next1 x1adder numa1 sum1Memory next1


View Full Document
Loading Unlocking...
Login

Join to view Lecture 22 - High-level Design and Optimization 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 22 - High-level Design and Optimization 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?