Berkeley COMPSCI 258 - L2 to Off-Chip Memory Interconnects for CMPs - D2638089

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> L2 to Off-Chip Memory Interconnects for CMPs

Berkeley COMPSCI 258 - L2 to Off-Chip Memory Interconnects for CMPs

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 10

Download Save

Unformatted text preview:

L2 to Off-Chip Memory Interconnects for CMPsDaniel KillebrewUniversity of California, BerkeleyComputer Science [email protected] LeeUniversity of California, BerkeleyComputer Science [email protected] recent years, chip multiprocessors (CMPs) have seen anincreasing asymmetry between the number of cores and thenumber of memory access points on a single die, promptinga new study of network topologies that efficiently connectmany nodes to few nodes. In this paper, we evaluate thelatency and power efficiency of the tapered-fat tree (TFT)topology on a cycle-accurate simulator intended to modelthe TILE64 multiprocessor. We replace the original meshnetwork with two TFT networks (one for memory requests,one for memory responses) and run four synthetic bench-marks modeled after those in the PARSEC suite. Becauseseveral connections in the TFT network require global wires,we also modeled the multi-cycle latencies using a wire-delaymodel. Our simulator keeps track of activity factors in eachof the routers, which we combine with the Orion power mod-els to determine the switching power of the routers. We de-termined that for applications with large amounts of shar-ing and little off-chip traffic, the TFT topology offers neg-ligible advantage over the mesh. However, the benchmarksthat exhibited large amounts of off-chip traffic completeda workload up to 33.6% faster on the TFT than the sameworkload on the mesh. The TFT topology also has betterpower-efficiency in all cases except canneal, which dissipatesa large amount of power in the inter-router interconnect dueto the high volume of stores. Although this benchmark dis-sipated more average power on the TFT, the total energyconsumed is less than on the mesh because it completedfaster. At the cost of extra design complexity, our resultssuggest that the TFT topology offers better latency andenergy-efficiency overall than the mesh for memory-intensiveapplications running on multicore processors.1. INTRODUCTIONChip multiprocessors (CMPs), single silicon dies with manysimple cores, have gained increasing attention as the com-puting industry has the combined power and ILP wall[1].Manycores present new engineering challenges and oppor-tunies due to their unique new architecture. The proximityof cores makes interprocessor communication much cheaper,but providing enough memory bandwidth for all these coresto function becomes a more serious problem.As CMPs have evolved from single core architectures, we canobserve the evolution of the network used to connect coresto both off-chip and on-chip memory. All designs have a pri-vate L1 cache, but the L2 cache, which is shared when thereare a few cores, becomes more private as cores are added.In chips with less cores, interprocessor communication oc-curs through a shared memory model, while those with morecores add explicit interprocessor networks for message pass-ing.Besides offering a possible solution to the problems that havestymied the uniprocessors of the past, CMPs will make in-teresting new workloads feasible. Applications such datamining and advanced video encoding will place an increasedstrain on the memory system. Some of these workloadsdemonstrate an increasing per-core performance as morecores are added to a chip[2]. This translates into an ex-ponential growth in memory bandwidth demand. Clearlythis exponential growth has significant implications for fu-ture CMPs.In this paper we examine several modern multicore chips,analyzing design trends in the off-chip memory intercon-nect (OCMI). We will summarize the results of the PAR-SEC benchmark suite as they apply to OCMIs. We lookat a modern 64-core chip, the Tile64, and use the resultsof the PARSEC suite to motivate a redesign of the Tile64’sOCMI. We provide a comparison of our proposal and theTile64’s current solution using detailed power analysis andsimulation results.1.1 Manycore Bandwidth ChallengeThe SPLASH2 benchmarks [3] have been the standard par-allel benchmarking suite for many years now, but they arenow somewhat dated. Some of the algorithms it uses arenow inferior to newer methods for solving the same prob-lems. Additionally, the benchmarks it contains are aimedat HPC and graphical simulations. CMPs will be used by amuch more diverse set of users with a correspondingly widerarray of applications. Thus, we use the PARSEC bench-marking suite [2] as the basis for our analysis.These new applications will have a large appetite for mem-ory bandwidth. There are 12 different applications in the1PARSEC suite, and of these 12, all but three have a work-ing set size larger than 8 MB. This is much larger than anycurrent on-chip cache, distributed or otherwise. Even if on-chip caches were to grow, these working set sizes grow alongwith the data set the algorithm is working on. Applicationssuch as data mining become more useful as the data set isincreased, so we expect data set sizes to grow with time. Asa result, we expect the working set of at least eight of theseapplications to exceed future on-chip cache sizes, creating adistinct need for off-chip memory bandwidth.The amount of off-chip memory bandwidth for half of theapplications scales linearly with the number of cores. Thismeans that as cores are added, the bandwidth demand in-creases because of the core added, as well as because thecores already on the chip increase their individual demands.This results in an exponentially increasing demand for off-chip bandwidth.The PARSEC applications do not exhibit the same demandfor interprocessor communication that they do for off-chipbandwidth. Four of the 12 exhibit an insignificant amountof sharing. Of those with true sharing, the data sets areso large that shared accesses between different processorsare temporally spaced far apart. The effect is that by thetime the second processor makes a shared access, the datahas already been moved off-chip because of a capacity con-flict. Two of the applications are parallelized in such a waythat data is passed between different computational stagesof the algorithm, allowing efficient data sharing. If coreswere added, this on-chip traffic would scale proportionally,but so would the interprocessor network as well, presumably.None of the applications demonstrate an increasing usage ofshared (or private) cache space as the number of processorsscales.The result of this will be a divergence of the requirements ofthe interprocessor communication network and the OCMI.Interprocessor

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 258 - L2 to Off-Chip Memory Interconnects for CMPs

Sign up for free to view:

Please select your school