Slide 1OutlinePerformance Optimization: Contending ForcesPerformance Optimization: Contending ForcesBasic Throughput QuantitiesLittle’s LawBasic Traffic QuantitiesArchitects, Mathematicians, ProgrammersSlide 9Slide 10Single-issue, non-pipelinedPipelinedPipelined (w/unrolling, reordering)Out-of-orderSuperscalarSIMDMultithreadedCoarse-grained MultithreadingFine-grained MultithreadingSimultaneous MultithreadingSlide 21Abstract Machine Model (as seen in programming model)Abstract Machine Model (as seen in programming model)Abstract Machine Model (as seen in programming model)Impact on Little’s Law ?out-of-order ?Software prefetch ?Hardware Stream Prefetchers ?Local Store + DMAMultithreadingSlide 31Options:What are SMPs? What is multicore ? What are multicore SMPs ?Multicore & SMP ComparisonNUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCAMulticore and Little’s Law?Best Architecture?Slide 53Multicore SMPs UsedMulticore SMPs Used (Conventional cache-based memory hierarchy)Multicore SMPs Used (local store-based memory hierarchy)Multicore SMPs Used (CMT = Chip-MultiThreading)Multicore SMPs Used (threads)Multicore SMPs Used (peak double precision flops)Multicore SMPs Used (total DRAM bandwidth)Multicore SMPs Used (Non-Uniform Memory Access - NUMA)Slide 62ChallengesChallenges: SequentialChallenges: Shared MemoryChallenges: Message PassingCoping with Diversity of HardwareSlide 68Auto-tuningAuto-tuningSlide 71Sparse Matrix Vector MultiplicationThe Dataset (matrices)SpMV ParallelizationSpMV ParallelizationSpMV ParallelizationSpMV ParallelizationSpMV Performance (simple parallelization)NUMA (Data Locality for Matrices)NUMA (Data Locality for Matrices)Prefetch for SpMVSpMV Performance (NUMA and Software Prefetching)ILP/DLP vs BandwidthMatrix Compression StrategiesMatrix Compression StrategiesSpMV Performance (Matrix Compression)Cache blocking for SpMV (Data Locality for Vectors)Cache blocking for SpMV (Data Locality for Vectors)Auto-tuned SpMV Performance (cache and TLB blocking)Slide 90Auto-tuned SpMV Performance (max speedup)Slide 92Slide 93SummarySummary (2)Slide 96LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PEvolution of Processor Architecture, and the Implications for Performance OptimizationSamuel Williams1,2Jonathan Carter2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2, David Patterson1,211University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of [email protected] U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYOutlineFundamentalsInterplay between the evolution of architecture and Little’s LawYesterday's Constraints - ILP/DLPToday's Constraints - MLPMulticore ArchitecturesSoftware Challenges and SolutionsSequential Programming ModelShared Memory WorldMessage Passing WorldOptimizing across an evolving hardware baseAutomating Performance Tuning (auto-tuning)Example: SpMVSummary2F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYPerformance Optimization: Contending ForcesContending forces of device Efficiency and usage/trafficWe improve time to solution by improving throughput (efficiency) and reducing traffic3ImproveThroughput(Gflop/s, GB/s, etc…)ReduceVolume of Data(Flop’s, GB’s, etc…)F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYPerformance Optimization: Contending ForcesContending forces of device Efficiency and usage/trafficWe improve time to solution by improving throughput (efficiency) and reducing trafficIn practice, we’re willing to sacrifice one in order to improve the time to solution.4Restructureto satisfyLittle’s LawImplementation &AlgorithmicOptimizationF U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYBasic Throughput QuantitiesAt all levels of the system (register files through networks), there are Three Fundamental (efficiency-oriented) Quantities:Latency every operation requires time to execute(i.e. instruction, memory or network latency)Bandwidth # of (parallel) operations completed per cycle(i.e. #FPUs, DRAM, Network, etc…)Concurrency Total # of operations in flight5F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYLittle’s LawLittle’s Law relates these three:Concurrency = Latency * Bandwidth- or -Effective Throughput = Expressed Concurrency / LatencyThis concurrency must be filled with parallel operationsCan’t exceed peak throughput with superfluous concurrency.(each channel has a maximum throughput)6F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYBasic Traffic QuantitiesTraffic often includes#Floating-point operations (FLOPs)#Bytes from (registers, cache, DRAM, network)Just as channels have throughput limits, kernels and algorithms can have lower bounds to traffic.7F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYArchitects, Mathematicians, ProgrammersArchitects invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.Mathematicians invent new algorithms to improve performance by reducing (bottleneck) trafficAs programmers, we must restructure algorithms and implementations to these new features.Often boils down to several key challenges:Management of data/task localityManagement of data dependenciesManagement of communicationManagement of variable and dynamic parallelism8LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PEvolution of Computer Architectureand Little’s Law9LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PYesterday’s Constraint:Instruction Latency & Parallelism10F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYSingle-issue, non-pipelinedConsider a single issue, non-pipelined processorLittle’s LawBandwidth = issue width = 1Latency = 1Concurrency = 1Very easy to get good performance even if all instructions are dependent11Issue widthIn flightcompletedFuture instructionsF U T U R E T E C H N O L O G
View Full Document