Unformatted text preview:

24Limitations of Algorithm AnalysisInherent communication in parallel algorithm is not all• artifactual communication caused by program implementation andarchitectural interactions can even dominate• thus, amount of communication not dealt with adequatelyCost of communication determined not only by amount• also how communication is structured• and cost of communication in systemBoth architecture-dependent, and addressed in orchestration stepTo understand techniques, first look at system interactions25What is a Multiprocessor?A collection of communicating processors•View taken so far•Goals: balance load, reduce inherent communication and extra workA multi-cache, multi-memory system•Role of these components essential regardless of programming model•Prog. model and comm. abstr. affect specific performance tradeoffsMost of remaining perf. issues focus on second aspect26Memory-oriented ViewMultiprocessor as Extended Memory Hierarchy–as seen by a given processorLevels in extended hierarchy:•Registers, caches, local memory, remote memory (topology)•Glued together by communication architecture•Levels communicate at a certain granularity of data transferNeed to exploit spatial and temporal locality in hierarchy•Otherwise extra communication may also be caused•Especially important since communication is expensive27UniprocessorPerformance depends heavily on memory hierarchyTime spent by a programTimeprog(1) = Busy(1) + Data Access(1)•Divide by cycles to get CPI equationData access time can be reduced by:• Optimizing machine: bigger caches, lower latency...• Optimizing program: temporal and spatial locality28Extended HierarchyIdealized view: local cache hierarchy + single main memoryBut reality is more complex•Centralized Memory: caches of other processors•Distributed Memory: some local, some remote; + network topology•Management of levels–caches managed by hardware–main memory depends on programming model•SAS: data movement between local and remote transparent•message passing: explicit•Levels closer to processor are lower latency and higher bandwidth•Improve performance through architecture or program locality•Tradeoff with parallelism; need good node performance and parallelism29Artifactual Comm. in Extended HierarchyAccesses not satisfied in local portion cause communication•Inherent communication, implicit or explicit, causes transfers–determined by program• Artifactual communication–determined by program implementation and arch. interactions–poor allocation of data across distributed memories–unnecessary data in a transfer–unnecessary transfers due to system granularities–redundant communication of data–finite replication capacity (in cache or main memory)•Inherent communication assumes unlimited capacity, small transfers,perfect knowledge of what is needed.• More on artifactual later; first consider replication-induced further30Communication and ReplicationComm induced by finite capacity is most fundamental artifact•Like cache size and miss rate or memory traffic in uniprocessors•Extended memory hierarchy view useful for this relationshipView as three level hierarchy for simplicity•Local cache, local memory, remote memory (ignore network topology)Classify “misses” in “cache” at any level as for uniprocessors–compulsory or cold misses (no size effect)–capacity misses (yes)–conflict or collision misses (yes)–communication or coherence misses (no)• Each may be helped/hurt by large transfer granularity (spatial locality)31Working Set Perspective•Hierarchy of working sets•At first level cache (fully assoc, one-word block), inherent to algorithm–working set curve for program•Traffic from any type of miss can be local or nonlocal (communication)•At a given level of the hierarchy (to the next further one)First working setCapacity-generated traffic(including conflicts)Second working setData trafficOther capacity-independent communicationCold-start (compulsory) trafficReplication capacity (cache size)Inherent communication32Orchestration for PerformanceReducing amount of communication:•Inherent: change logical data sharing patterns in algorithm•Artifactual: exploit spatial, temporal locality in extended hierarchy–Techniques often similar to those on uniprocessorsStructuring communication to reduce costLet’s examine techniques for both...33Reducing Artifactual CommunicationMessage passing model•Communication and replication are both explicit•Even artifactual communication is in explicit messagesShared address space model•More interesting from an architectural perspective•Occurs transparently due to interactions of program and system–sizes and granularities in extended memory hierarchyUse shared address space to illustrate issues34Exploiting Temporal Locality•Structure algorithm so working sets map well to hierarchy–often techniques to reduce inherent communication do well here–schedule tasks for data reuse once assigned•Multiple data structures in same phase–e.g. database records: local versus remote•Solver example: blocking•More useful when O(nk+1) computation on O(nk) data–many linear algebra computations (factorization, matrixmultiply)(a) Unblocked access pattern in a sweep(b) Blocked access pattern with B = 435Exploiting Spatial LocalityBesides capacity, granularities are important:•Granularity of allocation•Granularity of communication or data transfer•Granularity of coherenceMajor spatial-related causes of artifactual communication:•Conflict misses•Data distribution/layout (allocation granularity)•Fragmentation (communication granularity)•False sharing of data (coherence granularity)All depend on how spatial access patterns interact with data structures•Fix problems by modifying data structures, or layout/alignmentExamine later in context of architectures•one simple example here: data distribution in SAS solver36Spatial Locality Example• Repeated sweeps over 2-d grid, each time adding 1 to elements• Natural 2-d versus higher-dimensional array representation P6P7P4P8P0P3P5P6P7P4P8 P0P1P2P3P5 P2P1Page straddlespartition boundaries:difficult to distribute memory wellCache blockstraddles partitionboundary(a) Two-dimensional arrayPage doesnot straddlepartitionboundaryCache block is within a partition(b) Four-dimensional arrayContiguity in memory layout37Tradeoffs with Inherent CommunicationPartitioning grid


View Full Document

TAMU ECEN 676 - ch3_2

Documents in this Course
ch5_2

ch5_2

8 pages

ch4_3

ch4_3

8 pages

Load more
Download ch3_2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ch3_2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ch3_2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?