Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA

Home> Academic Documents> Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA

DOC PREVIEW

This preview shows page 1 out of 2 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA Roland E. Wunderlich Markus Püschel James C. Hoe Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213-3890 {rolandw, pueschel, jhoe}@ece.cmu.edu Abstract The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and pro-grammable I/O interfaces. We investigate the methods needed to achieve high per-formance MMM on the Texas Instruments C6713 floating-point DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices lar-ger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library. Introduction The availability of a high performance MMM implementa-tion is of critical importance for a large range of numerical computation problems. MMM is both a common stand-alone function and a ubiquitous kernel of more complex computations. Texas Instruments (TI) provides an optimized single preci-sion floating point MMM implementation for their C67x processors, the DSPF_sp_mat_mul() function. This as-sembly-coded function is optimal for matrices that can fit within the L1 data cache. Its innermost loop attains 100% of the peak performance of the C6713 with minimal over-head for the outer loop control code. Unfortunately, TI’s triple loop MMM implementation has poor data locality that leads to frequent cache misses for larger matrices. MMM temporal data locality is improved by partitioning the computation so that it operates on cache resident sub-matrices (called blocks). The computation that is per-formed on these cache resident blocks is called the mini-MMM. This blocked MMM algorithm is used by all fast MMM implementations for general purpose processors including the Goto BLAS library [4], the Intel Math Ker-nel Library [1], and ATLAS library generator [5]. While blocked MMM also improves performance on DSPs, further performance gains car be achieved by using their software-managed memory hierarchies. Specifically, the DMA engine can efficiently copy sub-matrices directly from main memory into a higher level of the memory hier-archy. This block copying can also be overlapped with mini-MMM computation. This approach has been applied before for MMM on the PlayStation 2’s vector units [2]. We develop a faster MMM implementation than the ven-dor implementation for a typical DSP, the TI C6713, with: (1) blocked computation, and (2) explicit block copies from main memory to scratch pad memory via DMA in parallel with the mini-MMM computation. DSP microarchitecture features The TMS320C6713 DSP [3] is the latest implementation in the C67x family of high-end floating-point (FP) DSP chips from Texas Instruments and is available at speeds up to 300 MHz. The C6713 has two single precision FP ad-ders, two FP comparators, and two FP multipliers that give it a peak performance of 1800 MFLOPS. The peak theo-retical performance for MMM is 1200 MFLOPS because only the adders and multipliers are used. The C6713 has two levels of cache and a 192 KB scratch pad SRAM, referred to by TI as “L2 mapped RAM.” Data in main memory are cached normally by the L1 caches and the L2 cache. Data explicitly allocated or copied into the scratch pad are also cached upon usage in the L1 caches. The 4 KB L1 data cache has a 4 cycle latency, and the 64 KB unified L2 cache has an 8 cycle latency and can also be used as scratch pad memory. The C6713 DMA engine allows for background memory block transfers. The C6713 DMA engine can also perform memory transfers to and from the scratch pad. This allows data to be loaded directly into memory hierarchy. Figure 1 illustrates the major components of the C6713 DSP. 8-wideVLIW core4 KB L1 instruction cacheMemoryinterfaceBufferedserial portsGeneralpurpose I/O...64 KBL2 cache192 KBscratch padDMAengine4 KB L1 data cacheFigure 1: The TI C6713 DSP microarchitectureMeasurement methodology All performance results are presented as MFLOPS as measured on the TI Code Composer Studio 3.0 cycle accu-rate simulator. Before each MMM implementation is measured, we touch the values of the input matrices to recreate the cache state as if the matrices had been pro-duced by a preceding function. The L2 cache was config-ured at its largest possible size of 64 KB. The TI TMS320C6x C/C++ Compiler 5.0 was used to compile our C code implementation. Currently, our MMM implemen-tations only support matrices that are even multiples of the block size. Preliminary implementations without this limi-tation see only minor losses of performance. Blocked MMM The first optimization we perform is to implement a blocked MMM to address the main shortcoming of the TI MMM for matrices larger than the data caches. We use two approaches to determine the best block size for our blocked MMM, a model-driven optimization by Yotov et al. [6], and a search similar to ATLAS [5]. The model-driven approach predicts that a block size of 28×28 is best for the 4 KB L1 data cache of the C6713. The re-sults of the empirical search match the model’s prediction, as plotted in Figure 2. Our blocked MMM function uses the TI MMM function for the mini-MMM computation, and a block size of 28×28. The performance of this implementation is plotted in Figure 3 for matrices up to 336×336. The performance of the TI MMM degrades as the input matrices no longer fit in the data caches, while the blocked MMM suffers far less due to its superior data locality. The blocked MMM is slower than the TI MMM for small matrices because it performs memory block copies. Blocked MMM with scratch pad and DMA The block copy operations that are required by a blocked MMM implementation can be performed faster by using the DMA engine of the C6713 instead of explicitly


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 2 pages.

Please select your school