Stanford EE 482C - Stream Cache Architectures for Irregular Stream Applications

Unformatted text preview:

EE482C Project Proposal:Stream Cache Architectures for Irregular Stream ApplicationsTimothy Knight Arjun Singh Jung Ho AhnMay 5, 2002CONTENTS 2Contents1 Introduction 32 Architectures Evaluated 32.1 Non-coherent Memory System Stream Cache ........................... 42.2 PartiallyCoherentMemorySystemStreamCache ........................ 42.3 Non-coherent Read-Only Miss-Fail Cluster Stream Cache .................... 52.4 Non-coherent Read-Only Miss-Stall Cluster Stream Cache .................... 52.5 Add-and-Store ............................................ 53 Applications Using Irregular Streams 6May5,20021 Introduction 31 IntroductionThis project examines various stream cache architectures for enabling the efficient implementation of irregularstream applications.2 Architectures EvaluatedMemory SystemStream Register FileNetworkCluster0Cluster15Inter-cluster SwitchMemory SystemNetworkMemory SystemLocalDRAMStream Register FileNetworkCluster0Cluster15Inter-cluster SwitchLocalDRAMStreamCacheLocalDRAMStreamCacheStreamCacheStream Register FileCluster0Cluster15Inter-cluster SwitchNo Cache Memory System Stream Cache Per-Cluster Stream CacheParameters Varied in Each Architecture Evaluated:- cache organization- cache access latency- cache access throughput- cache size- number of nodesParameters Kept Constant Across All Experiments:- number of clusters- DRAM size- SRF size- SRF to MS bandwidth- SRF to cluster bandwidth- inter-cluster switch bandwidth- network bandwidthAdd-and-Store:Each architecture is evaluated with and withouthardware add-and-store support for application 2.5 Architectures to be Evaluated:1. Baseline - no stream cache.2. Memory system cache (non-coherent).3. Memory system cache (partially coherent).4. Cluster cache (non-coherent, read-only, miss-fail).5. Cluster cache (non-coherent, read-only, miss-stall).Figure 1: Architectures evaluatedFigure 1 illustrates the 5 cache architectures evaluated. This section describes each of the architecturalfeatures employed.May5,20022.1 Non-coherent Memory System Stream Cache 42.1 Non-coherent Memory System Stream CacheThe memory system stream cache, when used, may be non-coherent. The following are salient pointsregarding this implementation of the stream cache.• The scalar program has explicit control of the stream cache; prefetches, loads, invalidates, and flushesmust all be initiated by stream instructions.• The stream cache is able to cache both local and remote addresses.• Stream cache elements can be read-only or write-back. Read-only elements, which are either ValidAor ValidB, can be gang-invalidated.• The hardware doesn’t provide any coherence support; stream cache entries which are not read-only arewrite-back, and if multiple nodes are writing to the same value, then they will simply have differentcopies of the same data in their stream caches.• When a dirty stream cache entry is flushed, it is written back to memory. If multiple nodes areall updating the same memory location, then the final value after all the updates have completed isnon-deterministic.• The segment registers contain a bit which can override requests to cache data from that segment.• An add-and-store operation always goes to memory, regardless of whether the destination address iscached.2.2 Partially Coherent Memory System Stream CacheThe memory system stream cache may also be partially coherent, via the following memory locking mecha-nism. Instead of issuing READ and WRITE requests to the memory system, each node will issue either:• READ-LOCK (RL)• UNLOCK (U)• READ-WITH-INTENT-TO-WRITE (RIW)• WRITE-UNLOCK (WU)Each memory address can be considered an instantiation of the state machine illustrated in figure 2.ValidLockedRead-OnlyRIWWUURLRLRIWWURLRIWUXXXXXFigure 2: Memory element finite state machineMay5,20022.3 Non-coherent Read-Only Miss-Fail Cluster Stream Cache 52.3 Non-coherent Read-Only Miss-Fail Cluster Stream CacheEach cluster may have a private stream cache with the following features:• There is no hardware support for coherency in these caches, either within a node or between nodes.• The caches are read-only all the time; new entries can be loaded, but existing entries can’t be updated.• The caches must be explicitly loaded (from the SRF) and invalidated by the microcontroller.• The assumption is made that the caches never miss; the cache read returns an undefined value in theevent of a cache miss.2.4 Non-coherent Read-Only Miss-Stall Cluster Stream CacheThis incarnation of the cluster cache differs from the miss-fail version in the handling of cache misses. Inthe event of a miss in the miss-stall cluster cache architecture, the following happens:• The microcontroller stalls all the clusters.• The microcontroller branches to a ‘cache load’ subroutine.• Each cluster cache outputs signals describing whether or not the last access was a miss, and if so, whataddress was needed.• The microcontroller checks all the caches which didn’t miss in parallel to see if any have the addresswhich was missed; if it is found, it is copied to the cache which needs it. The communication occursover the inter-cluster switch. For there to not be a conflict with an inter-cluster communication whichwas being prepared when the miss occurred, there must be some local registers on the communicationport which are reserved for cache miss handling.• If the address wasn’t found in another cache it is fetched from memory, via the SRF with an indexstream of length one.• ThecacheisupdatedwithanLRUpolicy.2.5 Add-and-StoreThe hardware can support an add-and-store operation, in which a node can send a value to be added to aglobal memory location atomically. The implementation is as follows:• The clusters output a series of (VirtAddr, PartialSum) tuples which are buffered in the SRF. Physically,these may be stored in the SRF as two different streams: one of addresses, and the other of the partialsums.• The stream controller issues an instruction to read the buffer of partial sums from the SRF and sendadd-and-store operations for each value to the appropriate address.• Upon reception of an add-and-store request, the memory-network interface passes the partial sum tothe memory system, which adds it to the specified address.Note that the software is responsible for the ordering of the partial sum updates across nodes. (i.e.) Ifmultiple nodes are all computing partial sums to add to the same memory location, the software


View Full Document

Stanford EE 482C - Stream Cache Architectures for Irregular Stream Applications

Download Stream Cache Architectures for Irregular Stream Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Stream Cache Architectures for Irregular Stream Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Stream Cache Architectures for Irregular Stream Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?