DOC PREVIEW
Exploiting Data Similarity to Reduce Memory

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

To appear in the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS’11)Exploiting Data Similarity to Reduce MemoryFootprintsSusmit Biswas, Bronis R. de Supinski, Martin SchulzLawrence Livermore National LaboratoryLivermore, CA - 94550, USAEmail: {biswas3, bronis, schulzm}@llnl.govDiana Franklin, Timothy Sherwood, Frederic T. ChongDepartment of Computer ScienceUniversity of California, Santa Barbara, USAEmail: {franklin, sherwood, chong}@cs.ucsb.eduAbstract—Memory size has long limited large-scale appli-cations on high-performance computing (HPC) systems. Sincecompute nodes frequently do not have swap space, physicalmemory often limits problem sizes. Increasing core counts perchip and power density constraints, which limit the numberof DIMMs per node, have exacerbated this problem. Further,DRAM constitutes a significant portion of overall HPC sys-tem cost. Therefore, instead of adding more DRAM to thenodes, mechanisms to manage memory usage more efficiently—preferably transparently— could increase effective DRAMcapacity and thus the benefit of multicore nodes for HPC systems.MPI application processes often exhibit significant data sim-ilarity. These data regions occupy multiple physical locationsacross the individual rank processes within a multicore nodeand thus offer a potential savings in memory capacity. Theseregions, primarily residing in heap, are dynamic, which makesthem difficult to manage statically.Our novel memory allocation library, SBLLmalloc, automati-cally identifies identical memory blocks and merges them into asingle copy. Our implementation is transparent to the applicationand does not require any kernel modifications. Overall, wedemonstrate that SBLLmalloc reduces the memory footprint ofa range of MPI applications by 32.03% on average and up to60.87%. Further, SBLLmalloc supports problem sizes for IRSover 21.36% larger than using standard memory managementtechniques, thus significantly increasing effective system size.Similarly, SBLLmalloc requires 43.75% fewer nodes than stan-dard memory management techniques to solve an AMG problem.I. MOTIVATIONMemory dominates the cost of HPC systems. The densityof DRAM components double every 3 years while logic com-ponents double every 2 years. Thus, memory size per core incommodity systems is projected to drop dramatically, as Fig-ure 1 illustrates. We expect the budget for an exascale systemto be approximately $200M and memory costs will account forabout half of that budget [21]. Figure 2 shows that monetaryconsiderations will lead to significantly less main memoryrelative to compute capability in exascale systems even if wecan accelerate memory technology improvements [21]. Thus,we must reduce application memory requirements per core.Virtual memory and swapping can increase effective mem-ory capacity. However, HPC applications rarely use them dueto their significant performance penalty and a trend towardsdiskless compute nodes in order to increase reliability.Fig. 1: Memory Capacity WallFig. 2: Cost curve [21]Prior proposals to reduce memory footprints based onDistributed Shared Memory (DSM) [14, 16] require users toidentify common data regions and to share memory explicitlyacross nodes. These solutions require modifications to iden-tify common data regions in the source code that make anapplication difficult to port and to maintain. In addition, thesystem can only benefit from similarities that the programmercan explicitly identify, making it especially difficult to exploitregions that are usually similar but not always. Our studiesshow that this dynamic similarity is common in MPI programs.Kernel level changes [7, 10, 13, 22, 23, 24] can reducethe application changes required to leverage data similarity.However, these solutions require more effort from systemadministrators, which complicates their adoption in productionsystems. The high performance computing (HPC) communityneeds an automated, user-level solution that exploits datasimilarity to reduce memory footprints.We present SBLLmalloc, a user-level memory managementsystem that transparently identifies identical data regionsacross tasks in an MPI application and remaps such regionsfor tasks on the same node to use the same physical mem-ory resource. SBLLmalloc traps memory allocation calls andtransparently maintains a single copy of identical data in acontent-aware fashion using existing system calls.In this paper we make the following contributions:• Detailed application studies that show identical datablocks exist across MPI tasks in many applications;• A user-level memory management library to reducememory footprints with no OS or application modifications;• Scaling and overhead results of the library for a range ofinput sizes for several large-scale applications;• A demonstration that SBLLmalloc enables large problemsize executions that are impossible with the default memoryallocator due to out-of-memory errors.Overall, our system transparently reduces peak memoryconsumption of our test applications by up to 60% (32% onaverage). More importantly, SBLLmalloc supports a 21.36%larger problem size of IRS, an implicit radiation solver ap-plication, using the same hardware resources. Further, we cansolve a problem of AMG, an Algebraic Multi-Grid solver, thatrequires 128 nodes with the default allocator, using only 72nodes with SBLLmalloc (i.e., 43.75% fewer nodes).The paper is organized as follows. Section II motivates ourproblem by showing the high degree of data similarity in MPIapplications. We describe the SBLLmalloc implementation ofour techniques to leverage this similarity in Section III. Wedescribe our experimental setup in Section IV and presentextensive results with SBLLmalloc in Section V.II. DATA SIMILARITY IN MPI APPLICATIONSMemory capacity significantly limits problem sizes, as asimple example demonstrates. As the problem size grows,more nodes are necessary in order to meet the increasedmemory requirements. Figure 3(a) shows a case study of AMGfrom the ASC Sequoia benchmark suite [2] on a 128-nodesystem, each node having 8 GB of main memory. The defaultmemory management line shows that as the problem sizegrows, the number of nodes necessary also grows.Reducing memory requirements per process is desirablebecause of the cost of memory and the gap between memorysize and computation power. Previous studies on cache archi-tectures [5] found significant data similarity across concurrent,yet independent, application


Exploiting Data Similarity to Reduce Memory

Download Exploiting Data Similarity to Reduce Memory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Exploiting Data Similarity to Reduce Memory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Exploiting Data Similarity to Reduce Memory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?