U of U CS 6810 - Lecture Notes - D2439665

Home> Schools> University of Utah> Computer Science (CS) > CS 6810> Lecture Notes

U of U CS 6810 - Lecture Notes

Pages 7

Download Save

Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah Static Scheduling, VLIW, EPIC & Speculation Today’s topics: HW support for better compiler scheduling VLIW/EPIC idea & IPF example 2 CS6810 School of Computing University of Utah Beating the IPC=1 Asymptote • Superscalar  static/compiler scheduled » common in embedded space MIPS & ARM  dynamic scheduled » HW scheduling via scoreboard/Tomasulo approach • VLIW  long instruction word contains set of independent ops  key – compiler schedule and hazard detection (εδ adv.) » each slot goes to a particular type of XU • similar to reservation station role  problem in high performance practice » need to be conservative w.r.t. run time activities • data dependent branch predicate » fix – add some HW to make less conservative but probable choice 3 CS6810 School of Computing University of Utah VLIW History • As usual it’s not new  late 60’s early 70’s – microcode » same idea, different granularity  80’s (textbook inaccurate on this) » Cydrome Cydra-5 (Rau/UIUC) & Multiflow (Fisher/Yale) • mini-super segment (Cray like performance on a budget) • killer micro ate them and the companies cratered • both Rau and Fisher go to HP to develop PA-WW – note both were compiler types (Fisher inspired by dataflow geeks)  90’s » HP wants out of process business, Intel wants a server line » HP & Intel jointly develop and produce Itanium • 2001 first release of “Merced” & IA-64  Now » AMD shocks x86 land w/ 64-bit architecture at MPF 2000 » poor IA-64 integer performance forces Intel to follow suit » IA-64  IPF still happening • now all Intel but “Itanic” problems persist 4 CS6810 School of Computing University of Utah “Itanic” • Interesting quotes:  John Dvorak (journalist) article » “How the Itanium killed the Computer Industry”  Ashlee Vance (tech columnist) » underperformance + product delays • “turned the product into a joke in the semiconductor industry”  Donald Knuth » “supposed to be terriffic – until it turned out that the wished-for compilers were basically impossible to write” • However  illustrates some interesting architectural tactics » approach highly valued in the embedded space  Tukwila (4 core IPF) » “what rhymes with Godzilla and has enough cache to take out Tokyo?” » 4 FB-Dimm channels • a move to dominate data-center now called “Cloud” appsPage 2 5 CS6810 School of Computing University of Utah Tukwila QPI is Intel’s response to AMD Hypertransport, 2 fbd’s missing on RHS 2 threads/core target delivery “real soon now” original target was 2007 OUCH 34 GB/s memory b/w 96 GB/s skt-skt b/w 30MB L2$ total 6 CS6810 School of Computing University of Utah VLIW Achilles’ Heel • Code compatibility  backwards compatibility » always a bit of a boat anchor  compiler schedules but what if the machine changes and you don’t have the source code?  oops • Solutions  Transmeta approach » dynamic object code translation • not wildly different than VM + dynamic issue  IPF approach » don’t be devout about VLIW • add some hardware support to allow some dynamic information 7 CS6810 School of Computing University of Utah Itanium Example • Registers  32 64-bit + poison bit flag GPR’s  128 82-bit FPR’s » 2 extra exponent bits over IEEE 754 80-bit standard  64 1-bit predication flags (single register)  8 64-bit indirect branch registers  large set of special purpose regs » I/O, system, memory map, OS interface » rich set of performance counters • Register stack  128 architected registers » 0-31 are the GPR’s » 32-127 are on the stack (cached or not) • special HW handles overflow and underflow » special instructions manipulate stack frame save and restore 8 CS6810 School of Computing University of Utah IPF Instructions and Slots • Instruction types  A = int ALU  I = shifts, bit-tests, moves  M = memory access  F = floats  B = branches  L+X = extended immediates, stop, nops • Instruction word slots  I = A or I types  M = A or M types  F = F types  B = B types  L+X = L+X typesPage 3 9 CS6810 School of Computing University of Utah IPF Groups and Bundles • Instruction group  set of parallel instructions  arbitrary length w/ explicit stop bit • Instruction bundle = 128 bits  a subset of a group that gets executed/cycle  contains pre-decode tag » 5 bits indicates what the bundle order contains what the bundle contains • permutations (5,3) = 20  5 bits » 3 41-bit instructions in the bundle  2 bundles decoded and executed per cycle » on Merced and McKinley » key: • compiler generates the group and organizes code into bundles • HW decides decode and issue rate 10 CS6810 School of Computing University of Utah IPF Predication and Speculation • Most instructions predicated on a predicate flag  10 compare types » result goes to 2 predication flags (dual rail encoding) • Speculation  GPR’s have a poison bit (indicating data validity) » Intel calls them NAT’s (Not a Thing)  FPR’s indicate poison by NATVal » mantissa=0, exponent outside legal range • hence the extra exponent bits » interesting choice  advanced loads » loads promoted over stores • return value to ALAT table (value, dest. reg, and mem. addr) » if a previous store executing later matches mem addr • ALAT invalidated, and register poisoned » interesting wrinkle on more common write buffer 11 CS6810 School of Computing University of Utah IPF Pipe • XU’s  2 I’s, M’s, F’s, 3-B’s, 1 L+X • Issue 2 bundles = 6 instructions max • Pipe – 10 macro stages  IPG – prefetch 2 bundles  Fetch – decode  Rotate – rotate bundle to align the stops  EXP – hand instructions to the XU’s – issue  REN – rename registers  WLD – bypass and access reg. file  REG – checks register scoreboard dependencies (dynamic stall if not cleared)  EXE – execute  DET – detect

View Full Document


School:
Email:
New Password:
Confirm Password:

U of U CS 6810 - Lecture Notes

Sign up for free to view:

Please select your school