CORNELL CS 612 - Software Systems for High-performance Architectures - D485042

Home> Schools> Cornell University> Computer Science (CS) > CS 612> Software Systems for High-performance Architectures

CORNELL CS 612 - Software Systems for High-performance Architectures

Course Cs 612- Compiler Design for High-Performance Architectures

Pages 19

Download Save

Unformatted text preview:

'&$%CS 612Software Systems for High-performance Architectures1'&$%Course Organization• Lecturer:Paul Stodghill, [email protected], Rhodes 496• TA: Rohit Fernandes, [email protected], Rhodes 490• URL:http://www.cs.cornell.edu/Courses/cs612/2002SP/• Prerequisites: Experience in writing moderate-sized (about2000 lines) programs, and interest in software forhigh-performance computers. CS 412 is desirable but notessential.• Lectures: two per week• Course-work: Four or ﬁve assignments which will involveprogramming on work-stations, and a substantial ﬁnal project.2'&$%Resources• Books (Recommended, not required)• “Advanced Compiler Design and Implementation”, SteveMuchnick, Morgan Kaufmann Publishers.• “Introduction to Parallel Computing”, Vipin Kumar et al,Benjamin/Cummings Publishers.• “Computer Architecture: A Quantitative Approach”, Hennessyand Patterson, Morgan Kaufmann Publishers.•Conferences• “ACM Symposium on Principles and Practice of ParallelProgramming”• “ACM SIGPLAN Symposium on Programming Language Designand Implementation”• “International Conference on Supercomputing”• “Supercomputing”3'&$%Objective− some emphasis on applications and architecture− primary emphasis on restructuring compilers, parallel languages (HPF), and libraries (OpenMP, MPI).We will study software systems that permit applications programsto exploit the power of modern high−performance computers.Computational ScienceDatabaseSystemsSoftware SystemsHigh−performanceWork−stationsShared−memoryMultiprocessorsDistributed−memoryMultiprocessorsSGI Octane, DEC Alpha SGI Origin 2000, CRAY T3E??IBM SP, AC3 Velocity4'&$%Conventional Software Environment• Languages: FORTRAN,C/C++, Java• Compiler: GNU (Dragon-book optimizations)• O/S: UNIX, Win32This software environment is not adequate for modernhigh-performance computers.To understand this, let us look at some high-performancecomputers.5'&$%The HP/CONVEX Exemplar: A Shared−Memory MultiProcessorHypernode Interconnect Interface: HypernodeHypernode Network of hypernodesCPUAgentCPUAgentCacheMemoryCacheMemoryMemoryPrivateGlobalMemoryNetworkCachePA−RISCParallelism:Processor cache 10 nsCPU private memory 500 nsHypernode private memory 500 nsNetwork cache 500 nsInterhypernode shared memory 2 microsecMemory latencies:Across hypernodes: NUMA Within hypernode: SMP(Symmetric MultiProcessor)(Non−uniform Memory Access machine)Coarse−grain parallelism: processors operate in parallelInstruction−level parallelism: each processor is pipelinedLocality of reference is extremely importantProgramming model: C/FORTRAN + OpenMP6'&$%Distributed-memory computers: each processor has a diﬀerentaddress space (eg. IBM SP-2)Programming model: C/FORTRAN + MPI7'&$%The AC3 Cluster: network of SMP nodesFast BusInterfaceCPUAgentCPUAgentCacheCacheMemoryPentium IIIGiganet Interconnect64 nodes..........................- Each node is a 4-way SMP with Pentium III processors- 64 nodes are connected by a Giganet interconnect- Within each node, we have a shared-memory multiprocessor- Across nodes, we have a distributed-memory multiprocessor=> Programming such a hybrid machine is even more complex!8'&$%Pipelined processors: must exploit instruction-level parallelismInstruction fetchBranchesFloating-point pipelineInteger pipelineBranch9'&$%Lessons for softwareTo obtain good performance on such high-performance computers,an application program must• exploit coarse-grain parallelism• exploit instruction level parallelism• exploit temporal and spatial locality of referenceLet us study how this is done, and understand why it is so hard toworry about both parallelism and locality simultaneously.10'&$%do j = 1..N do i = 1..N Y[i] = Y[i] + A[i,j]*X[j] yAx=do i = 1 ..Ndo i = 1 ..N x(2*i + 1) = ...x(2*i) .... x(i+1) = ....x(i) ....One of these loops is parallel, the other is sequential!Each row of the matrix can be multiplied by x in parallel.(ie., inner loop is a parallel loop) If addition is assumed to be commutative and associative,then outer loop is a parallel loop as well. Question: How do we tell which loops are parallel? Exploiting coarse-grain Parallelism11'&$%To exploit pipelines, instructions must be scheduled properly.LOAD R1, M1R1 <- R1 + 1 LOAD R2, M2R2 <- R2 + 1LOAD R1, M1 R2 <- R2 + 1LOAD R2, M2R1 <- R1 + 1LOADs are not overlappedLOADs are overlapped- Software pipelining: instruction reordering across loop boundaries- Hardware vs software: superscalar architectures: processor performs reordering on the fly VLIW, in-order issue architectures: hardware issues instructions in order(Intel P6, AMD K5, PA-8000)(CRAY , DEC ALPHA 21164)12'&$%Exploiting locality (I)P M P M P M P MInterconnection Network Y[i] = Y[i] + A[i,j]*X[j] yAx= do i = 1..Ndo j = 1..NData distribution: which data is mapped to each processor?Computation and data distributions should be "aligned"to optimize locality: a processor should "own" the data it needsfor its computation. Misaligned references: communicationQuestion: What are good distributions for MVM?Computation distribution: which iterations does a processor do?13'&$%Exploiting locality (II) Y[i] = Y[i] + A[i,j]*X[j] yAx= do i = 1..Ndo j = 1..NCPMCProcessor has a 1st and 2nd level cacheand local memoryUniprocessor locality• Program must have spatial and temporal locality of referenceto exploit caches.• Straight-forward coding of most algorithms results in programswith poor locality.• Data shackling: automatic blocking of codes to improve locality14'&$%Worrying simultaneously about parallelism and locality is hard.Radical solution: multithreaded processors• Forget about locality.• Processor maintains a pool of active threads.• When current thread makes a non-local memory reference,processor switches to diﬀerent thread.• If cost of context-switching is small, this can be a win.• Tera, IBM Blue Gene machine15'&$%SummaryTo obtain good performance, an application program must• exploit coarse-grain parallelism• exploit temporal and spatial locality of reference• exploit instruction level parallelismSystems software must support• low-cost process management• low-latency communication• eﬃcient synchronization16'&$%Mismatch with conventional software environments:• Conventional languages do not permit expression of parallelismor locality.• Optimizing compilers focus only on reducing the

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 612 - Software Systems for High-performance Architectures

Sign up for free to view:

Please select your school