DOC PREVIEW
Berkeley COMPSCI 150 - Lecture 26 - High-Level Design

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Spring 2010EECS150 - Lec26-hdl2Page EECS150 - Digital DesignLecture 26 - High-Level Design (Part 2)April 22, 2010John Wawrzynek1Spring 2010 EECS150 - Lec26-hld2Page List Processor Example• Design a circuit that forms the sum of all the 2's complements integers stored in a linked-list structure starting at memory address 0:• All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is 0.I/Os:– START resets to head of list and starts addition process.– DONE signals completion– R, Bus that holds the final result2Spring 2010 EECS150 - Lec26-hld2Page List Example Resource Scheduling• In this case, first spread out, then pack. 1. XMemory[NUMA], NUMANEXT+1; 2. NEXTMemory[NEXT], SUMSUM+X;• Three different loop iterations active at once.• Short cycle time (no dependencies within a cycle)• full utilization (only 2 cycles per result)• Initialization: x=0, numa=1, sum=0, next=memory[0]• Extra control states (out of the loop)– one to initialize next– one to finish off. 2 cycles after next==0.Memory next1 x1 adder numa1 sum1Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 3Spring 2010 EECS150 - Lec26-hld2Page 5. Optimization, Architecture #4• Datapath:• Incremental cost:– Addition of another register & mux, adder mux, and control.• Performance: find max time of the four actions 1. XMemory[NUMA], 0.5+1+10+1+1+0.5 = 14ns NUMANEXT+1; same for all ⇒ T>14ns, F<71MHz 2. NEXTMemory[NEXT], SUMSUM+X;LD_NUMA4Spring 2010 EECS150 - Lec26-hld2Page Other Optimizations• Node alignment restriction:– If the application of the list processor allows us to restrict the placement of nodes in memory so that they are aligned on even multiples of 2 bytes.•NUMA addition can be eliminated.•Controller supplies “0” for low-bit of memory address for NEXT, and “1” for X.– Furthermore, if we could use a memory with a 16-bit wide output, then could fetch entire node in one cycle:{NEXT, X}  Memory[NEXT], SUM  SUM + X;⇒ execution time cut in half (half as many cycles)5Spring 2010 EECS150 - Lec26-hld2Page List Processor Conclusions• Through careful optimization:– clock frequency increased from 32MHz to 71MHz – little cost increase. • “Scheduling” was used to overlap and to maximize use of resources.• Questions:– Consider the design process we went through:– Could a computer program go from RTL description to circuits automatically?– Could a computer program derive the optimizations that we did?6Spring 2010 EECS150 - Lec26-hld2Page Modulo Scheduling• Review of list processor scheduling:• How did we know to “spread” out the schedule of one iteration to allow efficient packing?• The goal of modulo scheduling is to find the schedule for one characteristic section of the computation. This is the part the control loops over.• The entire schedule can then be derived, by repeating the characteristic section or repeating it with some pieces omitted.Memory next1 x1 adder numa1 sum1Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 7Spring 2010 EECS150 - Lec26-hld2Page Modulo Scheduling Procedure1. Calculate minimal length of characteristic section.The maximum number of cycles that any one resource is used during one iteration of the computation (assuming a resource can only be used once per cycle).2. Schedule one iteration of the computation on the characteristic section wrapping around when necessary. Each time the computation wraps around, decrease the iteration subscript by one.3. If iteration will not fit on minimal length section, increase section by one and try again.8Spring 2010 EECS150 - Lec26-hld2Page nextiModulo Scheduling List Processor• Finished schedule for 4 iterations:nextiNEXTMemory[NEXT]numaiNUMANEXT+1xiXMemory[NUMA]sumiSUMSUM+X• Assuming a single adder and a single ported memory. Minimal schedule section length = 2. Because both memory and adder are used for 2 cycles during one iteration.Memory next1 next2 x1 next3 x2 next4 x3 adder numa1 numa2 sum1 numa3 sum2 numa4 sum3 numaimemoryaddernextinumaimemoryadderXi-1nextinumaimemoryadderXi-1sumi-2wrap-around,decrease subscriptwrap-around,decrease subscript9Spring 2010 EECS150 - Lec26-hld2Page Another Scheduling ExampleRepeating schedule:• Assume A, B, C, D, E stored in a dual port memory. • Assume a single adder. • Minimal schedule section length = 3. (Both memory and adder are used for 3 cycles during one iteration.)A B C D+ ++Ememory port 1adderload Aload BE =load Cload D store EA + B C + Dmemory port 2load Aload BE =load Cload D store EA + B C + Dload Aload BE =load Cload D store EA + B C + Dload Aload BE =load Cload D store EA + B C + DCompute Graph(one iteration of a repeating calculation)10Spring 2009 EECS150 - Lec27-hld3Page Parallelism • Example, Student final grade calculation: read mt1, ht2, mt3, project; grade = 0.2 × mt1 + 0.2 × mt2 + 0.2 × mt3 + 0.4 × project; write grade;• High performance hardware implementation:As many operations as possible are done in parallel.Parallelism is the act of doing more than one thing at a time.Optimization in hardware design often involves using parallelism to trade between cost and performance.11Spring 2009 EECS150 - Lec27-hld3Page Parallelism• Is there a lower cost hardware implementation? Different tree organization?• Can factor out multiply by 0.2:• How about sharing operators (multipliers and adders)?12Spring 2009 EECS150 - Lec27-hld3Page Time-Multiplexing• Time multiplex single ALU for all adds and multiplies:• Attempts to minimize cost at the expense of time.– Need to add extra register, muxes, control.• If we adopt above approach, we can then consider the combinational hardware circuit diagram


View Full Document

Berkeley COMPSCI 150 - Lecture 26 - High-Level Design

Documents in this Course
Lab 2

Lab 2

9 pages

Debugging

Debugging

28 pages

Lab 1

Lab 1

15 pages

Memory

Memory

13 pages

Lecture 7

Lecture 7

11 pages

SPDIF

SPDIF

18 pages

Memory

Memory

27 pages

Exam III

Exam III

15 pages

Quiz

Quiz

6 pages

Problem

Problem

3 pages

Memory

Memory

26 pages

Lab 1

Lab 1

9 pages

Memory

Memory

5 pages

Load more
Download Lecture 26 - High-Level Design
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 26 - High-Level Design and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 26 - High-Level Design 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?