Unformatted text preview:

EE482C: Advanced Computer Organization Lecture #15Stream Processor ArchitectureStanford University Tuesday, 28 May 2000Project UpdatesLecture #15: Thursday, 23 May 2002Lecturer: Project GroupsScribe: Ayodele ThomasReviewer: Mattan Erez1 Group 1: Aspect RatioGroup Members: James Bonanno, Suzanne Rivoire, Rex PetersenThree key areas are being explored to evaluate the impact of varying aspect ratios onthe Imagine architecture. 1) Cost Model 2) JPEG Compression Algorithm 3) Definitionof configurations1) Cost Model. The cost model is based on the performance per unit area. Approxi-mations are used to calculate the affect of different configurations on the chip area. Theaspects being varied include the number ALUs per cluster, the ALUsize, the number ofclusters, and the number of threads.2) JPEG. The application being used to evaluate the architectures is a JPG-likealgorithm that uses the discrete cosine transform (DCT) and run length coding (RLC).3) Architecture Configurations. The limiting factors are being explored to determinethe best configurations to exploit DLP, ILP, and TLP.One important question that was raised during the comment period was ”Is it reallyrealistic for the area to be only 15% greater than the current Imagine with a single clustercontaining 48 ALUs.” The effect of such components as the scratchpad in calculating theirareas was apparently ignored.2 Group 2: Viterbi AlgorithmGroup Members: John Davis, Andrew Lin, Njuguna Njoroge, Ayodele ThomasA general description of the Viterbi algorithm was given. The most interesting aspectof implementing the algorithm on Imagine is dealing with dependencies in the feedbackpath.Several approaches are being taken to extract parallelism. 1) Extract ILP. The prob-lem here is the detrimental affect of intercluster communication.2 EE482C: Lecture #152) Approach 2 extracts DLP and TLP by implementing multiple decoders. DLP isextracted by breaking a single message into frames to break the dependency chain whichare encoded on separate clusters. In order to break the dependency chain, the framesare padded such that they begin and end in a predetermined state. This is key becausewithout the padding, the dependency chain will go back to the beginning of the inputstream. TLP is extracted by encoding a separate message on each cluster.The Viterbi algorithm is also being benchmarked with a DSP running assembly codeso that a comparison can be made with the common implementation of the algorithm.The assembly code is functional and being profiled. The StreamC/KernelC code is in thedebug stage. A Matlab version has been implemented to create data streams and verifyresults.Several optimizations are being studied including using bit packing to reduce the sizeof the storage footprint.The work on coding the Viterbi will be used to give insight on how to manage feedbackdependencies in general on the Imagine architecture that is reflected in such algorithmsas IIR filters and Delta-Sigma.3 Group 3: CompilerGroup Members: Jayanth Gummaraju, Ahbishek Das, Mattan ErezThe compiler group commented that their work is looking more like translation thancompilation. They are converting the stream aspect of Brook into StreamC using theBrook metacompiler as a base.One problem that must be addressed is the fact Brook has an infinite stream lengthwhile StreamC must have a stream size declared. The syntax for calling kernels has beencompleted. Currently, files can be split automatically (headers, kernels, etc).4 Group 4: Stream CacheGroup Members: Timothy Knight, Arjun Singh, Jung Ho AhnThe simulation environment has been completed for evaluating stream caches. The ef-fect of having a cache vs. not having a cache is being explored. If an application is notmemory limited, a cache should give a benefit. The locality that can be exploited by thestream cache is due to small kernels that are called repeatedly.Two versions are being implemented. Version one is a memory system cache wherethe cache is located between the memory and the SRF. The second version is a clustercache with caches located between the SRF and the clusters.EE482C: Lecture #15 35 Group 5: IP Address LookupGroup Members: Henry Fu, Yeow Cheng Ong, Harn Hua NgThe coding of the IP Address lookup application with a single kernel has been com-pleted. It has been tested successfully with 2 Imagine processors. However there areproblems making it work with more than two processors and it will currently only workin debug mode - not with Isim.Two implementations are being considered which differ in the way that the lookuptable and data traffic are distributed. In the first implementation, the lookup table issplit in half and distributed to the two clusters. The same traffic streams through bothclusters until the best match is found within each cluster and then those two matchesare compared for the best result. In the second implementation, the traffic is split intwo and the two clusters share a single copy of the table. In this implementation, eachcluster selects the best match for different traffic.The current experimental lookup table is small such that the entire table fits in thescratchpad. It was not clear how a realistic table size (> 30K )thatwouldcertainlyoverflow the scratchpad would be handled.6 Group 6: Unstructured MeshGroup Members: Nuwan Jayasena, Yangjin Oh, Hsaio Heng Lee, Anand Ramalingam,Francois LabonteThis group is looking at how Imagine can be used to code an unstructured mesh. Con-ditional streams are used to help to fetch the appropriate data.One addition to the architecture that is being considered is a cache. A cache wouldbe beneficial if the same values are fetched repeatedly. Another addition is a hardwarebuffer. A random dataset is being simulated and bank and buffer size are modified toreduce the number of potential conflicts.Two hashing functions are being used, one which uses the division method and onewhich uses a multiplication method. Three apps will be used to evaluate the architectureincluding Motor, Tire, and Gear.7 Group 7: Legacy ArchitecturesGroup Members: Chaiyasit Manovit, John J Kim, Sanjit Zubin Biswas, Zi-Bin YangThe focus for the Legacy Architecture group is using compiler optimizations to modelarray accesses with affine indices in the Imagine architecture. A goal is to minimizeprefetching by using locality analysis and software pipelining.4 EE482C: Lecture #15Accesses are likely to be cache misses are tried to be determined up front.


View Full Document

Stanford EE 482C - Lecture 15

Download Lecture 15
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 15 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 15 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?