Spring 2010EECS150 - Lec26-hdl2Page EECS150 - Digital DesignLecture 26 - High-Level Design (Part 2)April 22, 2010John Wawrzynek1Spring 2010 EECS150 - Lec26-hld2Page List Processor Example• Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address 0:• All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is 0.I/Os:– START resets to head of list and starts addition process.– DONE signals completion– R, Bus that holds the final result2Spring 2010 EECS150 - Lec26-hld2Page List Example Resource Scheduling• In this case, first spread out, then pack. 1. XMemory[NUMA], NUMANEXT+1; 2. NEXTMemory[NEXT], SUMSUM+X;• Three different loop iterations active at once.• Short cycle time (no dependencies within a cycle)• full utilization (only 2 cycles per result)• Initialization: x=0, numa=1, sum=0, next=memory[0]• Extra control states (out of the loop)– one to initialize next– one to finish off. 2 cycles after next==0.Memory next1 x1 adder numa1 sum1Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 3Spring 2010 EECS150 - Lec26-hld2Page 5. Optimization, Architecture #4• Datapath:• Incremental cost:– Addition of another register & mux, adder mux, and control.• Performance: find max time of the four actions 1. XMemory[NUMA], 0.5+1+10+1+1+0.5 = 14ns NUMANEXT+1; same for all ⇒ T>14ns, F<71MHz 2. NEXTMemory[NEXT], SUMSUM+X;LD_NUMA4Spring 2010 EECS150 - Lec26-hld2Page Other Optimizations• Node alignment restriction:– If the application of the list processor allows us to restrict the placement of nodes in memory so that they are aligned on even multiples of 2 bytes.•NUMA addition can be eliminated.•Controller supplies “0” for low-bit of memory address for NEXT, and “1” for X.– Furthermore, if we could use a memory with a 16-bit wide output, then could fetch entire node in one cycle:{NEXT, X} Memory[NEXT], SUM SUM + X;⇒ execution time cut in half (half as many cycles)5Spring 2010 EECS150 - Lec26-hld2Page List Processor Conclusions• Through careful optimization:– clock frequency increased from 32MHz to 71MHz – little cost increase. • “Scheduling” was used to overlap and to maximize use of resources.• Questions:– Consider the design process we went through:– Could a computer program go from RTL description to circuits automatically?– Could a computer program derive the optimizations that we did?6Spring 2010 EECS150 - Lec26-hld2Page Modulo Scheduling• Review of list processor scheduling:• How did we know to “spread” out the schedule of one iteration to allow efficient packing?• The goal of modulo scheduling is to find the schedule for one characteristic section of the computation. This is the part the control loops over.• The entire schedule can then be derived, by repeating the characteristic section or repeating it with some pieces omitted.Memory next1 x1 adder numa1 sum1Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 7Spring 2010 EECS150 - Lec26-hld2Page Modulo Scheduling Procedure1. Calculate minimal length of characteristic section.The maximum number of cycles that any one resource is used during one iteration of the computation (assuming a resource can only be used once per cycle).2. Schedule one iteration of the computation on the characteristic section wrapping around when necessary. Each time the computation wraps around, decrease the iteration subscript by one.3. If iteration will not fit on minimal length section, increase section by one and try again.8Spring 2010 EECS150 - Lec26-hld2Page nextiModulo Scheduling List Processor• Finished schedule for 4 iterations:nextiNEXTMemory[NEXT]numaiNUMANEXT+1xiXMemory[NUMA]sumiSUMSUM+X• Assuming a single adder and a single ported memory. Minimal schedule section length = 2. Because both memory and adder are used for 2 cycles during one iteration.Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 numaimemoryaddernextinumaimemoryadderXi-1nextinumaimemoryadderXi-1sumi-2wrap-around,decrease subscriptwrap-around,decrease subscript9Spring 2010 EECS150 - Lec26-hld2Page Another Scheduling ExampleRepeating schedule:• Assume A, B, C, D, E stored in a dual port memory. • Assume a single adder. • Minimal schedule section length = 3. (Both memory and adder are used for 3 cycles during one iteration.)A B C D+ ++Ememory port 1adderload Aload BE =load Cload D store EA + B C + Dmemory port 2load Aload BE =load Cload D store EA + B C + Dload Aload BE =load Cload D store EA + B C + Dload Aload BE =load Cload D store EA + B C + DCompute Graph(one iteration of a repeating calculation)10Spring 2009 EECS150 - Lec27-hld3Page Parallelism • Example, Student final grade calculation: read mt1, ht2, mt3, project; grade = 0.2 × mt1 + 0.2 × mt2 + 0.2 × mt3 + 0.4 × project; write grade;• High performance hardware implementation:As many operations as possible are done in parallel.Parallelism is the act of doing more than one thing at a time.Optimization in hardware design often involves using parallelism to trade between cost and performance.11Spring 2009 EECS150 - Lec27-hld3Page Parallelism• Is there a lower cost hardware implementation? Different tree organization?• Can factor out multiply by 0.2:• How about sharing operators (multipliers and adders)?12Spring 2009 EECS150 - Lec27-hld3Page Time-Multiplexing• Time multiplex single ALU for all adds and multiplies:• Attempts to minimize cost at the expense of time.– Need to add extra register, muxes, control.• If we adopt above approach, we can then consider the combinational hardware circuit diagram
View Full Document