EE482C ProjectStream Cache Architectures for Irregular Stream ApplicationsTimothy KnightArjun SinghJung Ho AhnMotivationIrregular stream applications Perform a data-dependent traversal of an arbitrary graph Pointer chasingIndex streams Method for traversing irregular streams on a `traditional’ stream architecture (i.e.) Imagine Inefficient- Wasted SRF space- Wasted memory bandwidthArchitectural enhancements Stream cache Indexed SRF … ?Stream Cache Amplifies memory bandwidth in presence of temporal locality of reference Avoids repeated memory accesses for same address Allowing clusters to issue memory requests through a cluster cache avoids replicating data in the SRF.Applications StudiedApplication 1 All updates committed after all computations done Pseudo code:// Computation phasefor each vertex v:v.newdata = kernel (v.data, v.neigh.data)// Update phasefor each vertex v:v.data = v.newdataApplications Studied (Cont.)Application 2 Updates committed as soon as computed Conceptually an advancing wave-front of computation in a DAG Chosen to create coherence issues Pseudo code:repeat until all vertices are updated:for each vertex v with valid predecessors:v.data = kernel (v.neigh.data)Architectures Modeled1. No cache (baseline) 16 clusters 8 banked memory2. Cache accessible from SRF3. Dedicated cache accessible from clusters4. Cache accessible from clusters and SRFAll cache models: Have 8 banks, matching the memory Have 1 word per cache line Are not coherentArchitectures Modeled (Cont.)LRFSRFLoc. DRAMsNetLRFSRFLoc. DRAMsNetCacheSRFLoc. DRAMsNetCacheALUs ALUs ALUsArchitecture 1 Architecture 2 Architecture 3LRF LRFSRFLoc. DRAMsNetCacheALUsArchitecture 4MemorySystemClustersHardware Costs• Cache in memory system 8 SRAM banks, associated logic and buffers Need 1 word of tag storage per address cached• Enabling cluster memory access Reorder buffer: 16x 2/3*-ported register files Requests FIFO / AG 8 new buses between memory and clusters• Dedicated cluster cache (additional costs) 2 full 8x8 crossbarsSimulation EnvironmentImplemented a ‘cycle-by-cycle performance simulator’: Not functionally cycle-accurate Modeled throughputs, latencies, and resource constraints of architectures Applications coded in `macrocode’ Average sim: 5,000,000 cycles in 15 minutes of real time Parameters:- Data sets: record size, graph size, connectivity, locality- Apps: kernel computation time, strip size, cache use- Model: throughputs, latencies, mem. sizes, number of nodes, cache organizationResultsSpeedup vs. K (app1)00.511.522.533.540 2000 4000 6000 8000 10000 12000K (cycles)Speedup1 2 3 4Results (cont.)Speedup vs. K (app2)00.511.522.50 2000 4000 6000 8000 10000 12000K (cycles)Speedup1 2 3 4Results (cont.)Speedup vs. ROB length (app2)00.20.40.60.811.21.41.61.820 100 200 300 400 500 600ROB length Speedup1 3 4Results (cont.)Other results observed: Increase of speedup with increasing degree Relationship between speedup and number of words of data per record Constant speedup with increasing number of nodes for small data sets; large data sets blow up our simulator with many nodes Constant speedup with increasing number of verticesConclusion A stream cache can improve performance on an irregular stream app. Speedups of up to 3.5 were observed in our experiments A cache is most useful when the kernel performs little computation on each record. Various pros and cons between different cache architectures The architecture which performs best is determined by the choice of application and data setDiscussionLimitations on speedup from a stream cache: Amdahl’s Law: only some of the memory requests exhibit temporal locality Bank conflicts Dependencies in application Available locality ‘Preprocessing’ overheadFuture Work Study cache performance on real applications in
View Full Document