Multiscalar Gurindar S Sohi sohi cs wise edu Processors T N Vijaykumar vijay cs wise edu Scott E Breach breach cs wise edu Computer Sciences Department of Wisconsin Madison University Madison WI 53706 ILP processors and compilers typically convert the total ordering of instructions as they appear in the original ordering determined by dependence program into a partial on data and control Control dependence which appear as conditional branches present a major obstacle to highly Abstract Multiscalar processors use a new aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs A single program is divided into a collection of tasks by a combination of software and hardware The tasks are distributed to a number of parallel processing units which reside within a processor complex Each of these units fetches and executes instructions belonging to its assigned task The appearance of a single logical register file is maintained with a copy in each parallel processing unit Register results are dynamically routed among the many parallel processing units with the help of compiler generated masks Memory accesses may occur speculatively without knowledge of preceding loads or stores Addresses are disambiguated dynamically many in parallel and processing waits only for true data dependence parallel resolved valid execution because these dependence must be before all subsequent instructions are known to be Focusing on control dependence one can represent a where basic static program as a conlrol flow graph CFG blocks are nodes and arcs represent flow of control from one basic block to another Dynamic program execution can be viewed as walking through tbe program CFG generating a dynamic sequence of basic blocks which have to be executed for a particular run of the program To achieve high performance an ILP processor must attempt to walk through the CFG with a high level of parallelism Branch prediction with speculative execution is one commonly used technique for raising the level of parallelism that can be achieved during the walk The primary constraint on any parallel walk however is that it must preserve the sequential semantics assumed in the program This paper presents the philosophy of the multi scalar paradigm the structure of multiscalar programs and the hardware architecture of a multiscalar processor The paper also discusses performance issues in the mttltiscalar model and compares the multiscalar paradigm with other paradigms Experimental results evaluating the performance of a sample of multiscalar organizations are also presented In the multiscalar model of execution the CFG is partitioned into portions called tasks A multiscalar processor walks through the CFG speculatively taking task sized steps without pausing to inspect any of the instructions within a task A task is assigned to one of a collection of processing units for execution by passing the initial program counter of the task to the processing unit Multiple tasks then execute in parallel on the processing units resulting in an aggregate execution rate of multiple instructions per cycle 1 Introduction The basic paradigm of sequencing through a program the fetch execute cycle using a program counter has been with us for about 50 years A consequence of this sequencing paradigm is that programs are written with the tacit assumption that instructions will be executed in the same order as they appear in the program To achieve high performance however modern processors attempt to execute multiple instructions simultaneously and in some cases in a different order than the original program sequence This reordering may be done in the compiler m the hardware at execution time or both Superscalar and VLIW processors belong to this class of architectures that exploit instruction At this level the concept sounds simple however the key to making it work is the proper resolution of inter task data dependence In particular data that is passed between instructions via registers and memory must be routed correctly by the hardware Furthermore it is in this area of inter task data communication that the multiscalar approach level differs i e parallelism ILP significantly from more traditional multiprocessing methods Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage the ACM copyright notice and the title of the publication and its date appear and notice is given that copying is by permission of the Association of Computing Machinery To copy otherwise or to republish requires a fee and or specific permission ISCA 95 Santa Margherita Ligure Italy 63 1995 ACM 0 89791 698 0 95 0006 3 50 414 This paper describes the multiscalar approach to exploiting fine grain parallelism or instruction level parallelism or ILP Section 2 provides an overview of the multiscalar paradigm A breakdown of the distribution of the available processing unit cycles in multiscalar execution follows in Section 3 In Section 4 we compare multiscalar with other ILP paradigms A performance evaluation of potential configurations of a multiscalar processor is given in Section 5 In Section 6 we summarize ing remarks this work and offer conclud 2 An Overview of the Multiscalar 2 1 Philosophy Paradigm and Basics The objective of the non sequential walk of the CFG taken by a multiscalar processor is to establish a large and accurate dynamic window of instructions from which independent instructions can be extracted and scheduled for parallel execution An instruction window in ILP parlance is an assemblage of instructions under consideration for execution To perform this function a multi scalar processor walks through the CFG in large steps not instruction by instruction as is the case in a sequential processor nor basic G c1 Data block by basic block but rather task by task Bank A task is a portion of the CFG whose execution corresponds to a contiguous region of the dynamic instrtrction sequence e g part of a basic block a basic block mtrltiple basic blocks a single loop iteration an entire loop a function call etc A program is statically partitioned into tasks which are demarcated by annotations of the CFG more on this in Section 2 2 For each step of its walk a muhiscalar processor assigns a task to a processing unit for execuhon without concern for the actual contents of the task and
View Full Document
Unlocking...