DOC PREVIEW
Looking to Parallel Algorithms for ILP and Decentralization

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Looking to Parallel Algorithms for ILP and Decentralization Efraim Berkovich 1,2,3 , Bruce L. Jacob 1 , Joseph Nuzman 1,2,3 , and Uzi Vishkin 1,2,3 Dept. of Electrical Engineering 1 andUniversity of Maryland Institute for Advanced Computer Studies (UMIACS) 2 University of Maryland, College Park, MD Abstract: We introduce explicit multi-threading (XMT), a decentralized architecture that exploits fine-grained SPMD-style programming; a SPMD program can translate directly to MIPS assembly language using three additional instruction primitives. The motivation for XMT is: (i) to define an inherently decentralizable architecture, taking into account that the performance of future integrated circuits will be dominated by wire costs, (ii) to increase available instruction-level paral-lelism (ILP) by leveraging expertise in the world of parallel algorithms, and (iii) to reduce hardware complexity by alleviating the need to detect ILP at run-time: if parallel algorithms can give us an overabundance of work to do in the form of thread-level parallelism, one can extract instruction-level parallelism with greatly simplified dependence-checking.We show that implementations of such an architecture tend towards decentralization and that, when global communication is necessary, overall performance is relatively insensitive to large on-chip delays. We compare the performance of the design to more traditional parallel architectures and to a high-performance superscalar implementation, but the intent is merely to illustrate the per-formance behavior of the organization and to stimulate debate on the viability of introducing SPMD to the single-chip processor domain. We cannot offer at this stage hard comparisons with well-researched models of execution.When programming for the SPMD model, the total number of operations that the processor has to perform is often slightly higher. To counter this, we have observed that the length of the critical path through the dynamic execution graph is smaller than in the serial domain, and the amount of ILP is correspondingly larger. Fine-grained SPMD programming connects with a broad knowledge base in parallel algorithms and scales down to provide good performance relative to high-perfor-mance superscalar designs even with small input sizes and small numbers of functional units. Keywords: Fine-grained SPMD, parallel algorithms. spawn-join, prefix-sum, instruction-level parallelism, decentralized architecture. 3 supported by NSF grant 94168902 1 Introduction For some time, researchers have known that processor interconnect delays would some day constitute a dominating fraction of overall circuit delay [3, 29]. In future technologies, only a fraction of the processor will be reachable in a single clock cycle. For instance, at 0.1 µ m (projected by SIA to occur within a decade [30]), only 16% of the die will be reachable in a clock cycle [26]. One conclu-sion is that future architectures must limit global communication; that is, most communication should be localized, and functions that require the propagation of cross-chip signals should be few and infrequently used. This has already surfaced in contemporary research and actual processors: for instance, Farkas, et al . have investigated implementing a partitioned register file to circumvent clock speed issues [8], and this type of partitioned register file has appeared in the recent Alpha 21264 [14].In addition to the interconnect limit, architects are also faced with a perceived limit in avail-able instruction-level parallelism; that is, our ability to extract ILP is growing less rapidly than our ability to integrate larger numbers of functional units on a single processor. This has led to several innovative paradigms in recent years to use functional units in novel ways [7, 9, 10, 23, 25, 27, 34, 36], as opposed to simply increasing the issue width of traditional superscalar designs.To address these problems, we present an architecture that—like the multiscalar paradigm [10], the M-Machine architecture [9], “Raw” processors [36], single-chip multiprocessors [27], and simultaneous multi-threading [34]—maps well to future process technologies that are dominated by interconnect overheads and thus demand decentralization at an architecture level. Like the recent investigations of vector processing to more fully exploit on-chip bandwidth, to better map to future IC technologies, and to take full advantage of the large numbers of functional units available to today’s microarchitects [7, 25, 23], this proposed architecture combines two things: (i) programming-model-specific support for an inherently parallel model of execution, and (ii) contemporary concepts in high-performance microarchitecture. In our case, the inherently parallel model of execution is single program, multiple data (SPMD) , in which independent threads concurrently execute the same code on different data (e.g [22]). We use spawn-join “independence of order semantics” (IOS), where each virtual thread initi-ated by a spawn progresses at its own speed and terminates at a join instruction. Thus, no thread ever needs to wait on another thread. A process executing on the architecture will spawn threads that exe-cute asynchronously in parallel, and when all threads have join ed, the process returns to serial mode until the next spawn. Our architecture, which is called explicit multi-threading (XMT) , supports this3type of programming model. It implements spawn-join semantics in hardware, reminiscent of the n -way spawn-join semantics implemented in software by the KSR system [20] and the 2-way fork-join mechanism implemented in hardware by the P-RISC [28]. We provide no hardware support for sepa-rate stack space, as in P-RISC—all state local to a thread is contained in a separate hardware context: a local register file dedicated to each thread.In this paper, we show that the XMT architecture inherently lends itself to decentralized implementations in which most on-chip communication is localized. We show that the architecture is fairly insensitive to the cost of on-chip communication even when the time it takes a signal to cross the chip is 32 clock cycles (which should correspond roughly to 0.05 µ m technology [26, 30]). We describe a data cache organization and show how the programming paradigm inherently supports a relaxed consistency model, thereby allowing a simple cache coherence mechanism. We compare the performance


Looking to Parallel Algorithms for ILP and Decentralization

Download Looking to Parallel Algorithms for ILP and Decentralization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Looking to Parallel Algorithms for ILP and Decentralization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Looking to Parallel Algorithms for ILP and Decentralization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?