Berkeley COMPSCI 258 - Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures - D1681694

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

DOC PREVIEW

Berkeley COMPSCI 258 - Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Comparative Performance Evaluation ofCache-Coherent NUMA and COMA ArchitecturesPer Stenstromt, Truman Joe, and Anoop GuptaComputer Systems LaboratoryStanford University, CA 94305AbstractTwo interesting variations of large-scale shared-memory ma-chines that have recently emerged are cache-coherent mm-umform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have dis-tributed main memory and use directory-based cache coher-ence. Unlike CC-NUMA, however, COMA machines auto-matically migrate and replicate data at the main-memoty levelin cache-line sized chunks. This paper compares the perfor-mance of these two classes of machines. We first present aqualitative model that shows that the relative performance isprimarily determined by two factors: the relative magnitudeof capacity misses versus coherence misses, and the gramr-hirity of data partitions in the application. We then presentquantitative results using simulation studies for eight prtraUeIapplications (including all six applications from the SPLASHbenchmark suite). We show that COMA’s potential for perfor-mance improvement is limited to applications where data ac-cesses by different processors are finely interleaved in memoryspace and, in addition, where capacity misses dominate overcoherence misses. In other situations, for example where co-herence misses dominate, COMA can actually perform worsethan CC-NUMA due to increased miss latencies caused by itshierarchical directories. Finally, we propose a new architec-tural alternative, called COMA-F, that combines the advantagesof both CC-NUMA and COMA.1 IntroductionLarge-scale multiprocessors with a single address-space andcoherent caches offer a flexible and powerful computing en-vironment. The single address space and coherent cachestogether ease the problem of data partitioning and dynamicload balancing. They also provide better support for paral-lelizing compilers, standard operating systems, and multipro-gramming, thus enabling more flexible and effective use of thet Per StenstrOm’s address is Deparlrnent of Computer Engineering,Lund University, P.O. Box 118, S-221 00 LUND, Sweden.Permission to copy without fee all or part of this material IS grantedprovided that the copies are not made or distributed for direct commercialadvantage, the ACM copyright notice and the title of the publication andIts date appear, and notice IS given that copying is by permission of theAssociation for Computmg Machinery. To copy otherwise, or to repubhsh,requires a fee and/or specific perrmsslon.machine. Currently, many research groups are pursuing thedesign and construction of such multiprocessors [12, 1, 10].As research has progressed in this area, two interesting vari-ants have emerged, namely CC-NUMA (cache-coherent non-uniform memory access machines) and COMA (cache-onlymemory architectures).Examples of the CC-NUMA ma-chines are the Stanford DASH multiprocessor [12] and the MITAlewife machine [1], while examples of COMA machines arethe Swedish Institute of Computer Science’s Data DiffusionMachine (DDM) [10] and Kendall Square Research’s KSR1machine [4].Common to both CC-NUMA and COMA machines are thefeatures of distributed main memory, scalable interconnectionnetwork, and directory-based cache coherence.Distributedmain memory and scalable interconnection networks are es-sential in providing the required scalable memory bandwidth,while directory-based schemes provide cache coherence with-out requiring broadcast and consuming only a small fractionof the system bandwidth. In contrast to CC-NUMA machines,however, in COMA the per-node main memory is convertedinto an enormous secondary/tertiary cache (called attractionmemory (AM) by the DDM group) by adding tags to cache-line sized chunks in main memory. A consequence is that thelocation of a data item in the machine is totally decoupledfrom its physical address, and the data item is automaticallymigrated or replicated in main memory depending on the mem-ory reference pattern.The main advantage of the COMA machines is that theycart reduce the average cache miss latency, since data are dy-namically migrated and replicated at the main-memory level.However, there are also several disadvantages. Fkst, allowingmigration of data at the memory level requires a mechanismto locate the data on a miss. To avoid broadcasting such re-quests, current machines use a hierarchical directory structure,which increases the miss latency for global requests. Second,the coherence protocol is more complex because it needs toensure that the last copy of a data item is not replacedinthe attraction memory (main memory). Also, as compared toCC-NUMA, there is additional complexity in the design ofthe main-memory subsystem and in the interface to the disksubsystem.Even though CC-NUMA and COMA machines are beingbuilt, so far no studies have been published that evsthtate theperformance benefits of one machine model over the other.Such a study is the focus of this paper. We note that the paperfocuses on the relative performance of the two machines, and@ 1992 ACM 0-89791 .509.7/92/0005/0080 $1.50 80not on the hardware complexity. We do so because without agood understanding of the performance benefits, it is difficultto argue about what hardware complexity is justified.The organization of the rest of the paper is as follows. Inthe next section, we begin with detailed descriptions of CC-NUMA and COMA machines. Then in Section 3, we presenta qualitative model that helps predict the relative performanceof applications on CC-NUMA and COMA machines. Section4 presents the architecturrd assumptions and our simulation en-vironment. It also presents the eight benchmark applicationsused in our study, which include all six applications from theSPLASH benchmark suite [14]. The performance results arepresented in Section 5. We show that COMA’s potential forperformance improvement is limited to applications where dataaccesses by diffetent processors are interleaved at a tine spa-tial granularity and, in addition, where capacity misses dom-inate over coherence misses. We also show that for applica-tions which access data at a coarse granularity, CC-NUMA canperform nearly as well as a COMA by exploiting page-levelplacement or migration. Furthermore, when coherence missesdominate, CC-NUMA often performs better than COMA. Thisis due to the extra latency introduced by the hierarchical di-rectory structure in COMA. In Section 6, we present a

View Full Document