Berkeley COMPSCI 258 - Address Translation for Manycore Systems - D1223416

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> Address Translation for Manycore Systems

DOC PREVIEW

Berkeley COMPSCI 258 - Address Translation for Manycore Systems

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Address Translation for Manycore SystemsScott Beamer and Henry CookDepartment of Computer ScienceUniversity of California, Berkeley{sbeamer, hcook}@cs.berkeley.eduABSTRACTOne of the many challenges of designing efficient manycoresystems is to determine where and to what degree sharedinformation is cached locally. In this study we specificallyaddress efficient solutions for distributing virtual-to-physicaladdress translations and keeping them coherent throughouta chip multiprocessor system with hundreds of cores. Weevaluate multiple mechanisms in terms of their performanceand overhead with the aid of software simulation. Since TLBinformation is invalidated rarely, we find that the mecha-nisms with a fast common case performed much better, andthat TLB reload overhead (and not communication) was asignificant factor in the performance of many benchmarks.1. INTRODUCTIONMultiprocessor systems present a challenge for computerarchitects because they by definition are designed to con-currently update machine state. This leads to an implicittension throughout the memory hierarchy between provid-ing coherence for all replicated state versus having to syn-chronize on a single shared copy. One type of locally storedstate critical to performance is the physical address trans-lation and protection information for an address in virtualmemory. In most modern systems this information is cachedin a structure known as a Translation Lookaside Buffer. Ina large multiprocessor system, if translation information fora shared memory location is not replicated, the translationinformation itself may become a memory hot spot [7]. How-ever, if multiple copies of translation information exist in lo-cal TLBs, we must ensure that the copies are kept consistentwith each other and with the shared page table structure.Several solutions to this translation coherence problemhave been proposed and implemented in the distributed mul-tiprocessor domain, but prior work leaves doubt as to whichones scale most effectively with increasing core counts. Weseek to find a scheme that is readily applicable to a chipmanycore processor domain, specifically energy-efficient hand-held device systems. Collocating all processing elements onPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.the same chip reduces the communication overhead of coher-ence traffic and means that processors will share the sameoff-chip memory interface, both of which could have a signif-icant impact on the TLB coherence. The dynamic hardwarepartitioning capabilities of our target architecture also pro-vide a tool for limiting the scope at which coherence mustbe maintained. Finally, we are very concerned with mecha-nisms that support highly energy efficient systems, and thisinforms our decision as well.Shared address translation information is always writable,even if the mapped page is read only, since it can be changedwhenever page permissions are modified by hypervisor mech-anisms. Furthermore, even on a system with no TLBs, dur-ing the time when a processor has read the address trans-lation information but before it has accessed the referencedpage the processor can be considered to hold an implicit copyof the page table entry. On a multiprocessor we must usesome combination of locks, interrupts or atomic operationsto prevent observable states where local translation informa-tion does not match actual page table entries. This isolationcan be achieved by limiting which pages’ status and protec-tions can be changed, and we must do this in a way that isaware of all PTE copies implicitly or explicitly resident inprocessors or TLBs. Synchronization often results in serial-ization, whereas optimizations that enhance parallelism canresult in increased coherence network traffic.1.1 ParLab InfluenceThe Universal Parallel Computing Research Laboratory(ParLab) is attempting to solve the parallel computing prob-lem with an emphasis on what would be appropriate for amobile consumer device. This direction causes the systemto be single-socket and strongly oriented towards low power.Currently the architecture is a homogeneous grid of manytiles, each of which contains: a simple processing core, acache, and some hardware accelerators. To leverage thishardware and to make multiprogramming easier the systemis designed around the concept of hadrware partitions. Apartition will be a contiguous block of cores assigned to asingle application, and inside it the L2 cache is shared. TheOS will be deconstructed into a thin hypervisor that can runon any core. These features combined cause an applicationto look monolithic within a partition.1.2 SummaryIn this paper we present the scalability, performance, andthe energy efficiency of several TLB coherence schemes. Someof these attributes are evaluated purely in terms of TLB per-formance, while others require a view of the overhead asso-ciated with local data replication, and some will require aholistic view of the the system that includes cache perfor-mance and dynamically interleaved memory accesses. Weevaluate the performance and energy efficiency of these de-signs with full system simulations based runs of the PAR-SEC benchmarks [1] and synthetic system overheads.We find that TLB entry count and system page size havethe most significant impact on performance of the shoot-down scheme. Having cores share access to a single TLB ishighly effective in reducing TLB misses, and the reductionin delay cause by TLB reloads when using hierarchical TLBsis also significant. While TLB shootdown requires many or-ders of magnitude more messages than the validation, theadded synchronization it requires does not have a significantimpact on performance.2. RELATED WORKThere has been much previous work that explores the im-pact of TLB characteristics and designs on both uniproces-sor and multiprocessor systems. These studies have focusedon TLB reload mechanisms, memory hierarchy placement,and coherence mechanisms. A wide range of metrics havebeen used to gauge the impact of translation design decisionson overall system performance.Jacob et al. [3] perform a

View Full Document