TAMU ECEN 676 - ch3_2 - D1620271

Home> Schools> Texas A&M University> Electrical & Comp Engr (ECEN) > ECEN 676> ch3_2

TAMU ECEN 676 - ch3_2

Pages 23

Download Save

Unformatted text preview:

thus amount of communication not dealt with adequately artifactual communication caused by program implementation and architectural interactions can even dominate and cost of communication in system To understand techniques first look at system interactions Both architecture dependent and addressed in orchestration step also how communication is structured Cost of communication determined not only by amount Inherent communication in parallel algorithm is not all Limitations of Algorithm Analysis 24 Goals balance load reduce inherent communication and extra work Prog model and comm abstr affect specific performance tradeoffs Most of remaining perf issues focus on second aspect Role of these components essential regardless of programming model A multi cache multi memory system View taken so far A collection of communicating processors What is a Multiprocessor 25 as seen by a given processor Glued together by communication architecture Levels communicate at a certain granularity of data transfer Otherwise extra communication may also be caused Especially important since communication is expensive Need to exploit spatial and temporal locality in hierarchy Registers caches local memory remote memory topology Levels in extended hierarchy Multiprocessor as Extended Memory Hierarchy Memory oriented View 26 Divide by cycles to get CPI equation Optimizing machine bigger caches lower latency Optimizing program temporal and spatial locality Data access time can be reduced by Timeprog 1 Busy 1 Data Access 1 Time spent by a program Performance depends heavily on memory hierarchy Uniprocessor 27 Management of levels Improve performance through architecture or program locality Tradeoff with parallelism need good node performance and parallelism 28 Levels closer to processor are lower latency and higher bandwidth caches managed by hardware main memory depends on programming model SAS data movement between local and remote transparent message passing explicit Distributed Memory some local some remote network topology Centralized Memory caches of other processors But reality is more complex Idealized view local cache hierarchy single main memory Extended Hierarchy determined by program More on artifactual later first consider replication induced further determined by program implementation and arch interactions poor allocation of data across distributed memories unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity in cache or main memory Inherent communication assumes unlimited capacity small transfers perfect knowledge of what is needed Artifactual communication Inherent communication implicit or explicit causes transfers Accesses not satisfied in local portion cause communication Artifactual Comm in Extended Hierarchy 29 Extended memory hierarchy view useful for this relationship Local cache local memory remote memory ignore network topology compulsory or cold misses no size effect capacity misses yes conflict or collision misses yes communication or coherence misses no 30 Each may be helped hurt by large transfer granularity spatial locality Classify misses in cache at any level as for uniprocessors View as three level hierarchy for simplicity Like cache size and miss rate or memory traffic in uniprocessors Comm induced by finite capacity is most fundamental artifact Communication and Replication Second working set Replication capacity cache size Cold start compulsory traffic Inherent communication Other capacity independent communication Capacity generated traffic including conflicts First working set a given level of the hierarchy to the next further one working set curve for program 31 Traffic from any type of miss can be local or nonlocal communication Hierarchy of working sets At first level cache fully assoc one word block inherent to algorithm At Working Set Perspective Data traffic Let s examine techniques for both Structuring communication to reduce cost Techniques often similar to those on uniprocessors Artifactual exploit spatial temporal locality in extended hierarchy Inherent change logical data sharing patterns in algorithm Reducing amount of communication Orchestration for Performance 32 Even artifactual communication is in explicit messages Use shared address space to illustrate issues sizes and granularities in extended memory hierarchy Occurs transparently due to interactions of program and system More interesting from an architectural perspective Shared address space model Communication and replication are both explicit Message passing model Reducing Artifactual Communication 33 often techniques to reduce inherent communication do well here schedule tasks for data reuse once assigned More useful when O nk 1 computation on O nk data many linear algebra computations factorization matrix multiply a Unblocked access pattern in a sweep b Blocked access pattern with B 4 e g database records local versus remote Solver example blocking Multiple data structures in same phase Structure algorithm so working sets map well to hierarchy Exploiting Temporal Locality 34 Granularity of communication or data transfer Granularity of coherence Conflict misses Data distribution layout allocation granularity Fragmentation communication granularity False sharing of data coherence granularity Fix problems by modifying data structures or layout alignment one simple example here data distribution in SAS solver Examine later in context of architectures 35 All depend on how spatial access patterns interact with data structures Major spatial related causes of artifactual communication Granularity of allocation Besides capacity granularities are important Exploiting Spatial Locality P5 P4 P7 P3 Cache block straddles partition boundary P6 P2 a Two dimensional array Page straddles partition boundaries difficult to distribute memory well P8 P1 P0 P5 P4 P7 Cache block is within a partition P6 P3 b Four dimensional array Page does not straddle partition boundary P8 P1 P0 P2 Natural 2 d versus higher dimensional array representation Contiguity in memory layout Repeated sweeps over 2 d grid each time adding 1 to elements Spatial Locality Example 36 Result depends on n and p Rowwise can perform better despite worse inherent c to c ratio Poor spacial locality on nonlocal accesses at column oriented boundary Good spacial locality on nonlocal accesses at row oriented boudary Blocks still have a spatial locality problem on remote

View Full Document


School:
Email:
New Password:
Confirm Password:

TAMU ECEN 676 - ch3_2

Sign up for free to view:

Please select your school