Stanford EE 482C - Register Organization and Raw Hardware

Unformatted text preview:

EE482C: Advanced Computer Organization Lecture #7Stream Processor ArchitectureStanford University Thursday, 25 April 2002Register Organization and Raw HardwareLecture #7: Thursday, 25 April 2002Lecturer: Prof. Bill DallyScribe: Suzanne Rivoire, Yangjin OhReviewer: Mattan ErezLogistics :- Handout : Project Topics (prepare for brainstorming on Tues)- Project is in teams of up to 4- Assignment step 2 due on Friday at midnightDuring this lecture, we discussed two papers. The first paper, which constituted themajority of the lecture, is a comparison of the area requirements of different registerfile organizations. The second paper is an introduction to the hardware organization ofMIT’s Raw microprocessor.1 Register Organization for Media ProcessingS. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens. “RegisterOrganization for Media Processing.” HPCA 6, 2000.hw ppFigure 1: Schematic of register cellRegister files are commonly thought of as storage elements, but they also serve thepurpose of communication. In fact, for multiport register files above a threshold size, thearea of the communication switch dominates the area of the register file as a whole.2 EE482C: Lecture #7This paper recognizes the dual purpose of the register file and looks at ways to re-arrange and decouple the storage and communication functionalities. Starting with acentrally organized flat register file, it describes transformations from this central or-ganization into a single instruction multiple data (SIMD) organization, a distributedorganization, and a distributed SIMD organization. Then it describes further transfor-mations from the flat register file space to a hierarchical organization and finally a streamorganization.While the area, delay, and power dissipation of a register file are all important metricsfor measuring performance, in this lecture we decided to focus on area for simplicity. Thepaper presents a variety of parameters for measuring area, described in detail on its lastpage. Theareaofasingleregistercellisproportionalto(p + w)(p + h), where p is thenumber of word lines in one dimension and bit lines in the other, and h and w are theheight and width in wire tracks of the cell. A register file with a large number of portshas an area that grows with Rp2,whereR is the total number of registers and p is thetotal number of ports.1.1 Central Register OrganizationRegister"N" ALUFigure 2: Central register organizationAs a baseline, the paper first analyzes the central, flat register file architecture, shownin Figure 2. In this architecture, every ALU has two read ports and one write port tothe register file.The area of the central register file architecture is proportional to N3,whereN is thenumber of ALUs served by the register file.The number of ports in a central register file is given by the equation p =(3+pe)N,where peis the number of external connections needed by the register file. Later in thepaper, it is assumed that pecan be effectively replaced by M , the number of connectionsto memory. The value of M is usually116, indicating that the register file needs threeconnections per ALU and116that for memory. This further suggests that the ratio oflocal register file bandwidth to memory bandwidth is around 48:1. In Imagine, this ratiois actually much higher - around 250:1.EE482C: Lecture #7 3The number of registers in a central register file is given by R =(ra+T ∗ rm)N,wherermis the per-ALU number of memory references per cycle, T is the memory latency, andrais the number of registers per ALU, set equal to 10 here.const(working set) = r_TMemoryRegisterRequestDataFigure 3: Relationship between T and rmIn this paper, rm, an empirically deter-mined constant, is set equal to 4. T ,thememory latency, is set equal to 40. Fig-ure 3 shows the meaning of rmand T .Asthe memory latency grows, so does the num-ber of outstanding requests to memory andso, therefore, does the number of registers tohold these requests. Additionally, the stripsize is proportional to T , since we need longerstrips as the memory latency gets longer.In summary, the area of the central reg-ister file is proportional to N3,sincep2R =(3+M)2N(ra+ Trm)N2.1.2 SIMD Register OrganizationThe SIMD transformation splits the central register file into C separate clusters. Thearea needed by the SIMD register file is proportional to (NC)3.N/C...N/C.............Nlog(Area/ALU)log(ALU)Central SIMDCC^2Figure 4: Comparison of structure and area for SIMD and central register filesIf we plot the area per ALU versus number of ALUs for both the central and SIMD4 EE482C: Lecture #7architectures, as in Figure 4, we see that both plots are linear with a slope of 2 (sincethey are dominated by N2). This means that, although a lower constant is associatedwith the SIMD curve, the area of both register files is still proportional to N3. However,the SIMD curve is shifted to the right of the central curve by a distance of C, showingthat a SIMD architecture yields essentially C free ALUs over the central architecture.1.3 Distributed Register OrganizationThe next transformation is to a distributed register file (DRF), which separates the stor-age and communication functions of the register file more than the architectures we justexamined. In the DRF, each ALU input has its own dedicated register file; the dis-tributed files are tied together by a N ∗ 2N (2 input, 1 output per ALU) crossbar switchthat connects each ALU to all of the two-port register files. While the DRF transforma-tion results in substantial area savings, its disadvantage is restricted communication; nolonger can all ALUs communicate with all other ALUs as simply. Now some data valuesmust be replicated in multiple register files, and register demand per ALU is increased.crossbar N x 2N (two input, one output)register...(N/C) x (2N/C)...........(N/C) x (2N/C)...DRF + SIMDlog(Area/ALU)log(ALU)Central SIMDCC^2DRFDRF + SIMDDRFFigure 5: DRF and DRF+SIMD register filesThe area of a DRF is dominated by the switch and is thus proportional to N2.Plotting it on the same axes as the other register files shows a line with slope equal to 1.Transforming the DRF to a SIMD DRF is analogous to transforming a central registerfile to a SIMD register file; it improves the area by a constant factor, shifting the curveto the right and yielding C free ALUs.We have explored four register file organizations: central, SIMD, DRF, and DRFSIMD. Any of these can be


View Full Document

Stanford EE 482C - Register Organization and Raw Hardware

Download Register Organization and Raw Hardware
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Register Organization and Raw Hardware and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Register Organization and Raw Hardware 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?