Page 1 1 CS6810 School of Computing University of Utah Static Scheduling, VLIW, EPIC & Speculation Today’s topics: HW support for better compiler scheduling VLIW/EPIC idea & IPF example 2 CS6810 School of Computing University of Utah Beating the IPC=1 Asymptote • Superscalar static/compiler scheduled » common in embedded space MIPS & ARM dynamic scheduled » HW scheduling via scoreboard/Tomasulo approach • VLIW long instruction word contains set of independent ops key – compiler schedule and hazard detection (εδ adv.) » each slot goes to a particular type of XU • similar to reservation station role problem in high performance practice » need to be conservative w.r.t. run time activities • data dependent branch predicate » fix – add some HW to make less conservative but probable choice 3 CS6810 School of Computing University of Utah VLIW History • As usual it’s not new late 60’s early 70’s – microcode » same idea, different granularity 80’s (textbook inaccurate on this) » Cydrome Cydra-5 (Rau/UIUC) & Multiflow (Fisher/Yale) • mini-super segment (Cray like performance on a budget) • killer micro ate them and the companies cratered • both Rau and Fisher go to HP to develop PA-WW – note both were compiler types (Fisher inspired by dataflow geeks) 90’s » HP wants out of process business, Intel wants a server line » HP & Intel jointly develop and produce Itanium • 2001 first release of “Merced” & IA-64 Now » AMD shocks x86 land w/ 64-bit architecture at MPF 2000 » poor IA-64 integer performance forces Intel to follow suit » IA-64 IPF still happening • now all Intel but “Itanic” problems persist 4 CS6810 School of Computing University of Utah “Itanic” • Interesting quotes: John Dvorak (journalist) article » “How the Itanium killed the Computer Industry” Ashlee Vance (tech columnist) » underperformance + product delays • “turned the product into a joke in the semiconductor industry” Donald Knuth » “supposed to be terriffic – until it turned out that the wished-for compilers were basically impossible to write” • However illustrates some interesting architectural tactics » approach highly valued in the embedded space Tukwila (4 core IPF) » “what rhymes with Godzilla and has enough cache to take out Tokyo?” » 4 FB-Dimm channels • a move to dominate data-center now called “Cloud” appsPage 2 5 CS6810 School of Computing University of Utah Tukwila QPI is Intel’s response to AMD Hypertransport, 2 fbd’s missing on RHS 2 threads/core target delivery “real soon now” original target was 2007 OUCH 34 GB/s memory b/w 96 GB/s skt-skt b/w 30MB L2$ total 6 CS6810 School of Computing University of Utah VLIW Achilles’ Heel • Code compatibility backwards compatibility » always a bit of a boat anchor compiler schedules but what if the machine changes and you don’t have the source code? oops • Solutions Transmeta approach » dynamic object code translation • not wildly different than VM + dynamic issue IPF approach » don’t be devout about VLIW • add some hardware support to allow some dynamic information 7 CS6810 School of Computing University of Utah Itanium Example • Registers 32 64-bit + poison bit flag GPR’s 128 82-bit FPR’s » 2 extra exponent bits over IEEE 754 80-bit standard 64 1-bit predication flags (single register) 8 64-bit indirect branch registers large set of special purpose regs » I/O, system, memory map, OS interface » rich set of performance counters • Register stack 128 architected registers » 0-31 are the GPR’s » 32-127 are on the stack (cached or not) • special HW handles overflow and underflow » special instructions manipulate stack frame save and restore 8 CS6810 School of Computing University of Utah IPF Instructions and Slots • Instruction types A = int ALU I = shifts, bit-tests, moves M = memory access F = floats B = branches L+X = extended immediates, stop, nops • Instruction word slots I = A or I types M = A or M types F = F types B = B types L+X = L+X typesPage 3 9 CS6810 School of Computing University of Utah IPF Groups and Bundles • Instruction group set of parallel instructions arbitrary length w/ explicit stop bit • Instruction bundle = 128 bits a subset of a group that gets executed/cycle contains pre-decode tag » 5 bits indicates what the bundle order contains what the bundle contains • permutations (5,3) = 20 5 bits » 3 41-bit instructions in the bundle 2 bundles decoded and executed per cycle » on Merced and McKinley » key: • compiler generates the group and organizes code into bundles • HW decides decode and issue rate 10 CS6810 School of Computing University of Utah IPF Predication and Speculation • Most instructions predicated on a predicate flag 10 compare types » result goes to 2 predication flags (dual rail encoding) • Speculation GPR’s have a poison bit (indicating data validity) » Intel calls them NAT’s (Not a Thing) FPR’s indicate poison by NATVal » mantissa=0, exponent outside legal range • hence the extra exponent bits » interesting choice advanced loads » loads promoted over stores • return value to ALAT table (value, dest. reg, and mem. addr) » if a previous store executing later matches mem addr • ALAT invalidated, and register poisoned » interesting wrinkle on more common write buffer 11 CS6810 School of Computing University of Utah IPF Pipe • XU’s 2 I’s, M’s, F’s, 3-B’s, 1 L+X • Issue 2 bundles = 6 instructions max • Pipe – 10 macro stages IPG – prefetch 2 bundles Fetch – decode Rotate – rotate bundle to align the stops EXP – hand instructions to the XU’s – issue REN – rename registers WLD – bypass and access reg. file REG – checks register scoreboard dependencies (dynamic stall if not cleared) EXE – execute DET – detect
View Full Document