View Full Document

Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA



View the full content.
View Full Document
View Full Document

9 views

Unformatted text preview:

Accelerating Blocked Matrix Matrix Multiplication using a Software Managed Memory Hierarchy with DMA Roland E Wunderlich Markus P schel James C Hoe Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213 3890 rolandw pueschel jhoe ece cmu edu Abstract The optimization of matrix matrix multiplication MMM performance has been well studied on general purpose desktop and server processors Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near peak performance Typical digital signal processors DSPs do not have these features and instead use in order execution configurable memory hierarchies and programmable I O interfaces We investigate the methods needed to achieve high performance MMM on the Texas Instruments C6713 floatingpoint DSP This processor has two components that can be used to accelerate MMM a software managed memory hierarchy and a direct memory access DMA engine that can perform block copies from main memory to into the memory hierarchy Our MMM implementation overlaps computation with DMA block transfers For matrices larger than the data caches we observed a 46 performance increase over a blocked MMM implementation and a 190 increase over the Texas Instruments DSP library Introduction The availability of a high performance MMM implementation is of critical importance for a large range of numerical computation problems MMM is both a common standalone function and a ubiquitous kernel of more complex computations Texas Instruments TI provides an optimized single precision floating point MMM implementation for their C67x processors the DSPF sp mat mul function This assembly coded function is optimal for matrices that can fit within the L1 data cache Its innermost loop attains 100 of the peak performance of the C6713 with minimal overhead for the outer loop control code Unfortunately TI s triple loop MMM implementation has poor data



Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?