U of U CS 6810 - Static Scheduling, VLIW, EPIC & Speculation

Unformatted text preview:

Static Scheduling VLIW EPIC Speculation Today s topics HW support for better compiler scheduling VLIW EPIC idea IPF example School of Computing University of Utah 1 CS6810 Beating the IPC 1 Asymptote Superscalar static compiler scheduled common in embedded space MIPS ARM dynamic scheduled HW scheduling via scoreboard Tomasulo approach VLIW long instruction word contains set of independent ops key compiler schedule and hazard detection adv each slot goes to a particular type of XU similar to reservation station role problem in high performance practice need to be conservative w r t run time activities data dependent branch predicate fix add some HW to make less conservative but probable choice School of Computing University of Utah 2 Page 1 CS6810 VLIW History As usual it s not new late 60 s early 70 s microcode same idea different granularity 80 s textbook inaccurate on this Cydrome Cydra 5 Rau UIUC Multiflow Fisher Yale mini super segment Cray like performance on a budget killer micro ate them and the companies cratered both Rau and Fisher go to HP to develop PA WW note both were compiler types Fisher inspired by dataflow geeks 90 s HP wants out of process business Intel wants a server line HP Intel jointly develop and produce Itanium 2001 first release of Merced IA 64 Now AMD shocks x86 land w 64 bit architecture at MPF 2000 poor IA 64 integer performance forces Intel to follow suit IA 64 IPF still happening now all Intel but Itanic problems persist School of Computing University of Utah 3 CS6810 Itanic Interesting quotes John Dvorak journalist article How the Itanium killed the Computer Industry Ashlee Vance tech columnist underperformance product delays turned the product into a joke in the semiconductor industry Donald Knuth supposed to be terriffic until it turned out that the wished for compilers were basically impossible to write However illustrates some interesting architectural tactics approach highly valued in the embedded space Tukwila 4 core IPF what rhymes with Godzilla and has enough cache to take out Tokyo 4 FB Dimm channels a move to dominate data center now called Cloud apps School of Computing University of Utah 4 Page 2 CS6810 Tukwila 30MB L2 total 34 GB s memory b w 96 GB s skt skt b w 2 threads core target delivery real soon now original target was 2007 OUCH QPI is Intel s response to AMD Hypertransport 2 fbd s missing on RHS School of Computing University of Utah 5 CS6810 VLIW Achilles Heel Code compatibility backwards compatibility always a bit of a boat anchor compiler schedules but what if the machine changes and you don t have the source code oops Solutions Transmeta approach dynamic object code translation not wildly different than VM dynamic issue IPF approach don t be devout about VLIW add some hardware support to allow some dynamic information School of Computing University of Utah 6 Page 3 CS6810 Itanium Example Registers 32 64 bit poison bit flag GPR s 128 82 bit FPR s 2 extra exponent bits over IEEE 754 80 bit standard 64 1 bit predication flags single register 8 64 bit indirect branch registers large set of special purpose regs I O system memory map OS interface rich set of performance counters Register stack 128 architected registers 0 31 are the GPR s 32 127 are on the stack cached or not special HW handles overflow and underflow special instructions manipulate stack frame save and restore School of Computing University of Utah 7 CS6810 IPF Instructions and Slots Instruction types A int ALU I shifts bit tests moves M memory access F floats B branches L X extended immediates stop nops Instruction word slots I A or I types M A or M types F F types B B types L X L X types School of Computing University of Utah 8 Page 4 CS6810 IPF Groups and Bundles Instruction group set of parallel instructions arbitrary length w explicit stop bit Instruction bundle 128 bits a subset of a group that gets executed cycle contains pre decode tag 5 bits indicates what the bundle order contains what the bundle contains permutations 5 3 20 5 bits 3 41 bit instructions in the bundle 2 bundles decoded and executed per cycle on Merced and McKinley key compiler generates the group and organizes code into bundles HW decides decode and issue rate School of Computing University of Utah 9 CS6810 IPF Predication and Speculation Most instructions predicated on a predicate flag 10 compare types result goes to 2 predication flags dual rail encoding Speculation GPR s have a poison bit indicating data validity Intel calls them NAT s Not a Thing FPR s indicate poison by NATVal mantissa 0 exponent outside legal range hence the extra exponent bits interesting choice advanced loads loads promoted over stores return value to ALAT table value dest reg and mem addr if a previous store executing later matches mem addr ALAT invalidated and register poisoned interesting wrinkle on more common write buffer School of Computing University of Utah 10 Page 5 CS6810 IPF Pipe XU s 2 I s M s F s 3 B s 1 L X Issue 2 bundles 6 instructions max Pipe 10 macro stages IPG prefetch 2 bundles Fetch decode Rotate rotate bundle to align the stops EXP hand instructions to the XU s issue REN rename registers WLD bypass and access reg file REG checks register scoreboard dependencies dynamic stall if not cleared EXE execute DET detect exceptions and post NAT s WRB write back School of Computing University of Utah 11 CS6810 Merced SpecInt Performance School of Computing University of Utah 12 Page 6 CS6810 SpecFP is Better Newer versions close integer gap but x86 is still better New Tukwila focus on memory and socket to socket interconnect may prove to win BUT 8 core Nehalem waits in the wings with QPI as well Time will tell School of Computing University of Utah 13 CS6810 Philips Trimedia TM32 Pure VLIW for embedded space no HW hazard detection compiler does all saves runtime energy and delay no virtual memory loads stores don t generate exception no TLB and alignment issues main problem code bloat instruction memory is limited in embedded devices contribution to leakage current is a potential problem pure SW schedule large number of explicit NOPs School of Computing University of Utah 14 Page 7 CS6810 TM32 Performance on EEMBC School of Computing University of Utah 15 CS6810 Transmeta Crusoe Dynamic code morphing Boris Babayan s idea St Petersburg IPM x86 to VLIW in front end table based so adds post manufacture flexibility and some fault tolerance 5 slots risc style IU compute IU FU multi media U memory branch


View Full Document

U of U CS 6810 - Static Scheduling, VLIW, EPIC & Speculation

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Static Scheduling, VLIW, EPIC & Speculation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Static Scheduling, VLIW, EPIC & Speculation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Static Scheduling, VLIW, EPIC & Speculation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?