Berkeley COMPSCI 170 - Motivation for Improving Matrix Multiplication

Unformatted text preview:

CS170 – Spring 2010 – Lecture 4 – Jan 291 Motivation for Improving Matrix MultiplicationNow we will just consider the best way to implement the usual algorithm for matrix multi-plication, the one that take 2n3arithmetic operations for n-by-n matrices. To see that thereis something interesting to discuss, consider the plot below, which shows the speed of n-by-nmatrix multiplication on a Sun Ultra-1/170 (an old machine, but the plots would be similaron modern machines). The horizontal access is n, and the vertical access is speed me asured inMflops = millions of floating point arithmetic operations per second; the peak on this machineis 330 Mflops, or 2 per cycle. The top curve, labelled Sun Perf Lib, is a version hand-optimizedby the manufacturer especially for this machine. The bottom curve is the naive algorithm (3nested loops) in C; it is over 10 times slower for large n. Thus we see that while we mighthope that the compiler would be smart enough take our naive algorithm and make it run faston any particular machine, we see that compilers have a ways to go.Each manufacturer typically supplies a library like Sun Perf Lib, carefully hand-optimized forits machines. It includes a large number of routines, not just matrix multiplication, althoughmatrix multiplication is one of the most important, because it is used as part of benchmarkto compare the speed of computers.Finally, the middle curve, labeled PHIPAC, is the result of a research project here at Berkeleythat aimed to automatically generate code as good as hand-generated code for any architecture,and so automate the alternative tedious hand-optimization process.See www.icsi.berkeley.edu/~bilmes/phipac for information ab out PHIPAC, andmath-atlas.sourceforge.net for a later project called ATLAS inspired by PHIPAC thatproduces widely code used. For example, Matlab has used ATLAS.The goal of this part of the lecture notes is to explain the basic techniques used both by man-ufacturers in their hand optimizations and by PHIPAC and ATLAS to obtain these speedups.We remark that there is also active research in the compiler community on automaticallyapplying optimizations of the sort we discuss here, as well as active research at Berkeley inapplying these ideas to automatically tune the performance of other important functions.12 What operations should we count?So far we have counted arithmetic operations – additions and multiplications – in our com-plexity analysis. This is because our underlying assumption has been that these are the mostexpensive and most frequent operations in the algorithm, so that if Algorithm 1 did feweradditions and multiplications than Algorithm 2, it was automatically faster. Let us examinethis assumption, which is not entirely true.If you look at the assembly language version of matrix multiplication, which is what the ma-chine executes, then there are many kinds of instructions besides additions and multiplications,some of which are executed just as often as the additions and multiplications. Which onesare most expensive? It turns out that on most modern processors, the instructions that taketwo floating point numbers (like 3.1416) from two registers, add or multiply them, and putthem back in registers, are typically the fastest the machine performs. Addition is usually 1cycle, the least any instruction can take, and multiplication is usually as fast or perhaps alittle slower.2It turns out that loads and stores, i.e. moving data from memory to registers (which is where allthe arithmetic and other “useful work” takes place) are the most expensive, sometimes costinghundreds of cycles or more. The reason is that any computer memory, from the cheapest PCto the biggest supercomputer, is organized as a memory hierarchy:The idea is that the it is only possible to do arithmetic on data in a few expensive memorycells, usually called registers, that are right next to the arithmetic unit. They are too expensiveand therefore too few to store all your data, so most data must be stored in larger, slowercheaper memory. Whenever you want to do arithmetic on data not in registers, you have togo get the data from this slower memory, which takes more time.If all the data were either in registers or in the main memory, then there would be an enormouspenalty for accessing data not in registers, since main memory is hundreds of times slower. Sopeople almost always build intermediate levels in the memory hierarchy, called caches, thatare intermediate in cost, size and speed between registers and main memory. Thus accessingdata not in registers but in cache is not as slow as accessing main memory.Here is an everyday analogy. Suppose you are sitting at your desk, and you need a pencil towrite something. The first place you look is the top of your desk, and if a pencil is there, you3pick it up and use with a minimum of overhead. This corresponds to using data in registers.But if the p encil is not in the desk, you have to open the desk drawer and look there. Sinceyou probably keep more pencils in your drawer than your desk top, you will likely find one,and it will only take a little more time to open the drawer to get it, than it would if the pencilwere on the desk. This corresponds to using the cache.Now suppos e your desk drawer had no pencils. You have to get up and go to the supply cabinetto get pencils, which takes yet more time. And you should not just get one pencil from thesupply cabinet, you should get a small box of pencils, so you’ll have them the next time. Thiscorresponds to going to main memory, and fetching a whole cache line of consecutive words ofmemory, in the expectation that if you need one, you’ll need more of them.Next, if the supply cabinet it empty, you have to go to the store to buy pencils. This takessignificantly longer, and you will probably buy a bigger box of pencils, so you don’t have togo to the store very often. This corresponds to taking a page fault, and getting a whole page(thousand of bytes) of your data off of disk.In the worst case, if the store is sold out, they have to go to a wholesale distributor to get awhole carton of pencils, which corresponds to getting your data off of tape, the bottom of thememory hierarchy.Thus the principle behind memory hierarchies is a perfectly normal economic principle weencounter in everyday life. A typical computer, including a PC, may have two levels of cache(on-chip and off-chip), as well as main memory, internal disk and external


View Full Document

Berkeley COMPSCI 170 - Motivation for Improving Matrix Multiplication

Download Motivation for Improving Matrix Multiplication
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motivation for Improving Matrix Multiplication and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motivation for Improving Matrix Multiplication 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?