Berkeley COMPSCI C267 - Evolution of Processor Architecture, and the Implications for Performance Optimization - D2553384

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Evolution of Processor Architecture, and the Implications for Performance Optimization

DOC PREVIEW

Berkeley COMPSCI C267 - Evolution of Processor Architecture, and the Implications for Performance Optimization

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 96

This preview shows page 1-2-3-4-5-6-45-46-47-48-49-50-51-91-92-93-94-95-96 out of 96 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 96 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1OutlinePerformance Optimization: Contending ForcesPerformance Optimization: Contending ForcesBasic Throughput QuantitiesLittle’s LawBasic Traffic QuantitiesArchitects, Mathematicians, ProgrammersSlide 9Slide 10Single-issue, non-pipelinedPipelinedPipelined (w/unrolling, reordering)Out-of-orderSuperscalarSIMDMultithreadedCoarse-grained MultithreadingFine-grained MultithreadingSimultaneous MultithreadingSlide 21Abstract Machine Model (as seen in programming model)Abstract Machine Model (as seen in programming model)Abstract Machine Model (as seen in programming model)Impact on Little’s Law ?out-of-order ?Software prefetch ?Hardware Stream Prefetchers ?Local Store + DMAMultithreadingSlide 31Options:What are SMPs? What is multicore ? What are multicore SMPs ?Multicore & SMP ComparisonNUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCANUMA vs NUCAMulticore and Little’s Law?Best Architecture?Slide 53Multicore SMPs UsedMulticore SMPs Used (Conventional cache-based memory hierarchy)Multicore SMPs Used (local store-based memory hierarchy)Multicore SMPs Used (CMT = Chip-MultiThreading)Multicore SMPs Used (threads)Multicore SMPs Used (peak double precision flops)Multicore SMPs Used (total DRAM bandwidth)Multicore SMPs Used (Non-Uniform Memory Access - NUMA)Slide 62ChallengesChallenges: SequentialChallenges: Shared MemoryChallenges: Message PassingCoping with Diversity of HardwareSlide 68Auto-tuningAuto-tuningSlide 71Sparse Matrix Vector MultiplicationThe Dataset (matrices)SpMV ParallelizationSpMV ParallelizationSpMV ParallelizationSpMV ParallelizationSpMV Performance (simple parallelization)NUMA (Data Locality for Matrices)NUMA (Data Locality for Matrices)Prefetch for SpMVSpMV Performance (NUMA and Software Prefetching)ILP/DLP vs BandwidthMatrix Compression StrategiesMatrix Compression StrategiesSpMV Performance (Matrix Compression)Cache blocking for SpMV (Data Locality for Vectors)Cache blocking for SpMV (Data Locality for Vectors)Auto-tuned SpMV Performance (cache and TLB blocking)Slide 90Auto-tuned SpMV Performance (max speedup)Slide 92Slide 93SummarySummary (2)Slide 96LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PEvolution of Processor Architecture, and the Implications for Performance OptimizationSamuel Williams1,2Jonathan Carter2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2, David Patterson1,211University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of [email protected] U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYOutlineFundamentalsInterplay between the evolution of architecture and Little’s LawYesterday's Constraints - ILP/DLPToday's Constraints - MLPMulticore ArchitecturesSoftware Challenges and SolutionsSequential Programming ModelShared Memory WorldMessage Passing WorldOptimizing across an evolving hardware baseAutomating Performance Tuning (auto-tuning)Example: SpMVSummary2F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYPerformance Optimization: Contending ForcesContending forces of device Efficiency and usage/trafficWe improve time to solution by improving throughput (efficiency) and reducing traffic3ImproveThroughput(Gflop/s, GB/s, etc…)ReduceVolume of Data(Flop’s, GB’s, etc…)F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYPerformance Optimization: Contending ForcesContending forces of device Efficiency and usage/trafficWe improve time to solution by improving throughput (efficiency) and reducing trafficIn practice, we’re willing to sacrifice one in order to improve the time to solution.4Restructureto satisfyLittle’s LawImplementation &AlgorithmicOptimizationF U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYBasic Throughput QuantitiesAt all levels of the system (register files through networks), there are Three Fundamental (efficiency-oriented) Quantities:Latency every operation requires time to execute(i.e. instruction, memory or network latency)Bandwidth # of (parallel) operations completed per cycle(i.e. #FPUs, DRAM, Network, etc…)Concurrency Total # of operations in flight5F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYLittle’s LawLittle’s Law relates these three:Concurrency = Latency * Bandwidth- or -Effective Throughput = Expressed Concurrency / LatencyThis concurrency must be filled with parallel operationsCan’t exceed peak throughput with superfluous concurrency.(each channel has a maximum throughput)6F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYBasic Traffic QuantitiesTraffic often includes#Floating-point operations (FLOPs)#Bytes from (registers, cache, DRAM, network)Just as channels have throughput limits, kernels and algorithms can have lower bounds to traffic.7F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYArchitects, Mathematicians, ProgrammersArchitects invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.Mathematicians invent new algorithms to improve performance by reducing (bottleneck) trafficAs programmers, we must restructure algorithms and implementations to these new features.Often boils down to several key challenges:Management of data/task localityManagement of data dependenciesManagement of communicationManagement of variable and dynamic parallelism8LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PEvolution of Computer Architectureand Little’s Law9LAWRENCE BERKELEY NATIONAL LABORATORYF U T U R E T E C H N O L O G I E S G R O U PYesterday’s Constraint:Instruction Latency & Parallelism10F U T U R E T E C H N O L O G I E S G R O U PLAWRENCE BERKELEY NATIONAL LABORATORYSingle-issue, non-pipelinedConsider a single issue, non-pipelined processorLittle’s LawBandwidth = issue width = 1Latency = 1Concurrency = 1Very easy to get good performance even if all instructions are dependent11Issue widthIn flightcompletedFuture instructionsF U T U R E T E C H N O L O G

View Full Document