Page 1 1 CS6810 School of Computing University of Utah Big Iron Today’s topics: Vector Processors and Supercomputers VP’s came first – now exist as GPGPU’s figure source: text Appendix F Supercomputers lots of microprocessors with a fancy interconnect – a look at the top500 Datacenter “cloud” Computing lots of blades w/ fancy interconnect AND fancy storage systems (this is not DRAM!) 2 CS6810 School of Computing University of Utah Review • Roadblocks to parallelism wide issue & deep pipelines » dynamic OOO issue • huge # of instructions on the fly • quadratic circuit complexity to keep track of everything – forwarding, ROB size, # of registers • power density kills you • performance still limited by ILP in the program » VLIW • compiler does most of the scheduling work • still huge # of instructions on the fly • power density is still a problem – this will continue to be a common theme • performance also limited by ILP • Enhancing parallelism multi- threads, cores, sockets » main game today » might be easier to build than program 3 CS6810 School of Computing University of Utah 1st Supercomputers • Vector machines often attributed to Seymour Cray, but he says • “I’m certainly not inventing vector processors. There are three kind that I know of existing today. They are represented by the Illiac-IV, th (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors. . . . One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made. talk at LLNL – 1976 – on the introduction of the CRAY-1 • Alternative programming model two data types » scalar and vector • not wildly dissimilar to map reduce (Google reinvention) – map sub-problems to some set of resources – reduce/combine sub-problem into final answer APL – Iverson’s 1969 book » +/( 1, 2, 3) = 6 4 CS6810 School of Computing University of Utah Replace Loops w/ Vector Instructions • Vector-Vector add conventional » 2 pointers to head of two vectors » offset with loop variable • A[i] + B[i] for all I vector model » Vadd A, B /1 instruction does a lot of work » no loop or instruction decode overhead » hazard checking only required between vector instructions • Issues each vector has to be contiguous machine has a native vector length » 64 was common • pad if actual vector length is not in chunks of 64 scientific programmers embraced the vector model » but how do you write a web browser?Page 2 5 CS6810 School of Computing University of Utah 2001 Vector Odyssey • Vector machines out of fashion • 2002 Japan’s Earth Simulator announced » virtual planet • predict environmental change impact on world climate » leads top500 list • widespread US panic @ government level – strategic leadership lost? – oh woe is us or U.S. • spurs supercomputer development – including new vector machines from Cray • Now wide-SIMD alive and well in GPGPU’s short-SIMD alive and well in CPU’s SIMD = short vector » same issues apply 6 CS6810 School of Computing University of Utah Basic Vector Architecture • 2 parts scalar unit » similar to a normal CPU • OOO: NEC SX/5 • VLIW: Fujitsu VPP5000 vector unit » multiple FU’s (both int & float) • deeply pipelined for high clock frequencies • particularly true for FPU’s – primary focus for the scientific comp folks • 2 basic architecture types vector-register processors » early CDC machnes memory-memory vector processors (vector RISC) » everything since about 1980 • CRAY 1, 2, XMP, YMP, C90, T90, SV1, X1 • NEC SX/2-SX/8, Fujitsu VP200-VPP5000, Hitachi S820 and S8300 • Convex C-1 through C-4 7 CS6810 School of Computing University of Utah Top Level Vector-Register VMIPS 64 element Vregs 2 read ports 1 write port is it enough? 8 CS6810 School of Computing University of Utah Snippet of Real MachinesPage 3 9 CS6810 School of Computing University of Utah VMIPS ISA Snippet 1 10 CS6810 School of Computing University of Utah VMIPS ISA Snippet 2 11 CS6810 School of Computing University of Utah DAXPY: MIPS vs. VMIPS IC = 6 vs 600 12 CS6810 School of Computing University of Utah Performance • Vector execution time f(vector length, structural hazards, data hazards) » initiation rate: # of operands consumed or produced per cycle » multi-lane architecture • each vector lane can carry n values per cycle – often 2 or more • # vector lanes * lane width = initiation rate also dependent on pipeline fill and spill • Convoys (made up term) set of independent vector instructions » similar to an EPIC VLIW bundle • Chime time it takes to execute 1 convoy • Start up time time it takes to load the vector registers and fill the pipe • All contribute to execution timePage 4 13 CS6810 School of Computing University of Utah Vector Memory Systems • Lots of bandwidth required to feed lots of XU’s very wide data bus banked memory » each bank indpendently addressed • not interleaved • multiple load and stores issued per cycle • each bank serves a particular load or store – assuming no bank conflict – compiler tries hard to avoid conflict • latency can be high for DRAM based memory – but bandwidth can be quite good – early CRAY machines used SSRAM’s – too expensive today » addressing? where are the bank select bits? 14 CS6810 School of Computing University of Utah Vector Length Control • Vec.reg.length != operand.vec.size (OVS) MVL = vec.reg.length enter VLR » specifies the operand vec size for a vector instruction • actual vector size often not known until run time • may even change based on a call parameter • APL rho(V) length (or structure if vector of vectors of …) » controls XU’s and Vector_Ld_Store Unit » VLR value <= vector.reg.length • hence not known until run time • statically known then compiler can issue
View Full Document