Stanford EE 482C - Stream Cache Architectures for Irregular Stream Applications - D2039700

Home> Schools> Stanford University> Electrical Engineering (EE) > EE 482C> Stream Cache Architectures for Irregular Stream Applications

Stanford EE 482C - Stream Cache Architectures for Irregular Stream Applications

School name Stanford University

Course Ee 482c- Advanced Computer Organization Processor Architecture

Pages 6

Download Save

Unformatted text preview:

EE482C Project Proposal:Stream Cache Architectures for Irregular Stream ApplicationsTimothy Knight Arjun Singh Jung Ho AhnMay 5, 2002CONTENTS 2Contents1 Introduction 32 Architectures Evaluated 32.1 Non-coherent Memory System Stream Cache ........................... 42.2 PartiallyCoherentMemorySystemStreamCache ........................ 42.3 Non-coherent Read-Only Miss-Fail Cluster Stream Cache .................... 52.4 Non-coherent Read-Only Miss-Stall Cluster Stream Cache .................... 52.5 Add-and-Store ............................................ 53 Applications Using Irregular Streams 6May5,20021 Introduction 31 IntroductionThis project examines various stream cache architectures for enabling the eﬃcient implementation of irregularstream applications.2 Architectures EvaluatedMemory SystemStream Register FileNetworkCluster0Cluster15Inter-cluster SwitchMemory SystemNetworkMemory SystemLocalDRAMStream Register FileNetworkCluster0Cluster15Inter-cluster SwitchLocalDRAMStreamCacheLocalDRAMStreamCacheStreamCacheStream Register FileCluster0Cluster15Inter-cluster SwitchNo Cache Memory System Stream Cache Per-Cluster Stream CacheParameters Varied in Each Architecture Evaluated:- cache organization- cache access latency- cache access throughput- cache size- number of nodesParameters Kept Constant Across All Experiments:- number of clusters- DRAM size- SRF size- SRF to MS bandwidth- SRF to cluster bandwidth- inter-cluster switch bandwidth- network bandwidthAdd-and-Store:Each architecture is evaluated with and withouthardware add-and-store support for application 2.5 Architectures to be Evaluated:1. Baseline - no stream cache.2. Memory system cache (non-coherent).3. Memory system cache (partially coherent).4. Cluster cache (non-coherent, read-only, miss-fail).5. Cluster cache (non-coherent, read-only, miss-stall).Figure 1: Architectures evaluatedFigure 1 illustrates the 5 cache architectures evaluated. This section describes each of the architecturalfeatures employed.May5,20022.1 Non-coherent Memory System Stream Cache 42.1 Non-coherent Memory System Stream CacheThe memory system stream cache, when used, may be non-coherent. The following are salient pointsregarding this implementation of the stream cache.• The scalar program has explicit control of the stream cache; prefetches, loads, invalidates, and ﬂushesmust all be initiated by stream instructions.• The stream cache is able to cache both local and remote addresses.• Stream cache elements can be read-only or write-back. Read-only elements, which are either ValidAor ValidB, can be gang-invalidated.• The hardware doesn’t provide any coherence support; stream cache entries which are not read-only arewrite-back, and if multiple nodes are writing to the same value, then they will simply have diﬀerentcopies of the same data in their stream caches.• When a dirty stream cache entry is ﬂushed, it is written back to memory. If multiple nodes areall updating the same memory location, then the ﬁnal value after all the updates have completed isnon-deterministic.• The segment registers contain a bit which can override requests to cache data from that segment.• An add-and-store operation always goes to memory, regardless of whether the destination address iscached.2.2 Partially Coherent Memory System Stream CacheThe memory system stream cache may also be partially coherent, via the following memory locking mecha-nism. Instead of issuing READ and WRITE requests to the memory system, each node will issue either:• READ-LOCK (RL)• UNLOCK (U)• READ-WITH-INTENT-TO-WRITE (RIW)• WRITE-UNLOCK (WU)Each memory address can be considered an instantiation of the state machine illustrated in ﬁgure 2.ValidLockedRead-OnlyRIWWUURLRLRIWWURLRIWUXXXXXFigure 2: Memory element ﬁnite state machineMay5,20022.3 Non-coherent Read-Only Miss-Fail Cluster Stream Cache 52.3 Non-coherent Read-Only Miss-Fail Cluster Stream CacheEach cluster may have a private stream cache with the following features:• There is no hardware support for coherency in these caches, either within a node or between nodes.• The caches are read-only all the time; new entries can be loaded, but existing entries can’t be updated.• The caches must be explicitly loaded (from the SRF) and invalidated by the microcontroller.• The assumption is made that the caches never miss; the cache read returns an undeﬁned value in theevent of a cache miss.2.4 Non-coherent Read-Only Miss-Stall Cluster Stream CacheThis incarnation of the cluster cache diﬀers from the miss-fail version in the handling of cache misses. Inthe event of a miss in the miss-stall cluster cache architecture, the following happens:• The microcontroller stalls all the clusters.• The microcontroller branches to a ‘cache load’ subroutine.• Each cluster cache outputs signals describing whether or not the last access was a miss, and if so, whataddress was needed.• The microcontroller checks all the caches which didn’t miss in parallel to see if any have the addresswhich was missed; if it is found, it is copied to the cache which needs it. The communication occursover the inter-cluster switch. For there to not be a conﬂict with an inter-cluster communication whichwas being prepared when the miss occurred, there must be some local registers on the communicationport which are reserved for cache miss handling.• If the address wasn’t found in another cache it is fetched from memory, via the SRF with an indexstream of length one.• ThecacheisupdatedwithanLRUpolicy.2.5 Add-and-StoreThe hardware can support an add-and-store operation, in which a node can send a value to be added to aglobal memory location atomically. The implementation is as follows:• The clusters output a series of (VirtAddr, PartialSum) tuples which are buﬀered in the SRF. Physically,these may be stored in the SRF as two diﬀerent streams: one of addresses, and the other of the partialsums.• The stream controller issues an instruction to read the buﬀer of partial sums from the SRF and sendadd-and-store operations for each value to the appropriate address.• Upon reception of an add-and-store request, the memory-network interface passes the partial sum tothe memory system, which adds it to the speciﬁed address.Note that the software is responsible for the ordering of the partial sum updates across nodes. (i.e.) Ifmultiple nodes are all computing partial sums to add to the same memory location, the software

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford EE 482C - Stream Cache Architectures for Irregular Stream Applications

Sign up for free to view:

Please select your school