Papers Kozyrakis and Patterson Vector vs superscalar and VLIW architectures for embedded media benchmarks Data Parallel Architectures Khailany Dally et al Imagine Media Processing with Streams EE 392C April 3 2003 Nuwan Jayasena Suzanne Rivoire Multimedia Applications Video audio graphics signal processing Characteristics High data level parallelism DLP Little global reuse Lots of computation A smart multimedia architecture Explicitly supports DLP rather than trying to infer it Memory hierarchy uses software managed structures rather than a traditional cache Imagine 1 VIRAM Vectors for multimedia Exploit DLP with vector lanes pipes Have a vector register file memory on chip and scrap the cache VIRAM contributions Imagine Media Processing with Streams Lots of vector registers Support for multiple data widths Vector register permutation instructions Virtual addressing Stream Programming Model Khailany Dally et al Key contributions Streams are sequences of homogeneous elements records Kernels apply the same computation to all elements of its input stream s and produce output stream s Exposes data parallelism among stream elements Potential for instruction level parallelism in operations applied to one stream element Stream programming model Stream processor Imagine architecture A stream program orchestrates the passing of data streams among a set of kernels to implement a desired application 2 Stream Processor Imagine Architecture Coprocessor Many ALUs exploit concurrency in stream applications Streaming Imagine Summary Pros Programming model exposes data transfers in bandwidth hierarchy Extends storage hierarchy with 2 levels of software controlled storage Record order transfers makes DRAM accesses more efficient Multiple clusters in parallel exploit data parallelism SIMD control Multiple ALUs within each cluster exploit ILP VLIW control But putting many ALUs on a chip isn t that hard Feeding data to the ALUs is a much harder problem Cons Programming model places more burden on programmer Imagine is not a general purpose processor Paper doesn t provide adequate comparison points to other architectures Bandwidth hierarchy tailored to capture locality Local register files capture intermediate results between individual operations on one stream element Stream register file captures intermediate streams between kernels Advantages of Data Parallel Computing Cheap parallelism Key Issues for EE392C Predictable memory accesses Decouple memory accesses from computation to cover memory latency Lower power for a given level of performance Lower clock freq lower supply voltages and or shallower pipelines lower instruction overhead etc Lower design effort Replicate a simple pipeline or cluster Lower circuit implementation effort Data parallel on chip multi processors Greater exploitation of DLP by running multiple processors in SIMD Exploit TLP in addition to DLP and ILP by each processor running its own thread Control and communication incl synchronization techniques Easy to express and exploit Data parallel architecture as a configuration of polymorphic processors What resources need to be configured Memory hierarchy Compute units Instructions control Which resources yield biggest gains with minimal reconfiguration overhead 3 Discussion Backup Slides What are the key similarities and differences Pros and cons In what situations might VIRAM outperform Imagine and vice versa Are data parallel architectures really the solution for multimedia applications Could superscalar or VLIW processors achieve the same goals maybe with multimedia or other ISA extensions How do these architectures exploit ILP What about apps with little or no DLP Communication in a Stream Processor LRFs can only be accessed within local cluster SRF bank only accessible to local cluster Any cluster can access any memory bank Inter cluster network provides some communication among clusters Imagine Performance Results Expensive communication network restricted to the level with lowest bandwidth Conversely communication requires going to the most expensive level of bandwidth hierarchy 4
View Full Document
Unlocking...