CS 152 Computer Architecture and Engineering Lecture 23 Putting it all together Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences University of California Berkeley http www eecs berkeley edu krste http inst cs berkeley edu cs152 Intel Nehalem Review entire semester by looking at most recent microprocessor from Intel Nehalem is code name for microarchitecture at heart of Core i7 and Xeon 5500 series server chips First released at end of 2008 5 5 2009 CS152 Spring 09 2 Nehalem System Example Apple Mac Pro Desktop 2009 Each chip has three DRAM channels attached each 8 bytes wide at 1 066Gb s 3 8 5GB s Two Nehalem Chips Sockets each containing four processors cores running at up to 2 93GHz Can have up to two DIMMs on each channel up to 4GB DIMM QuickPath point point system interconnect between CPUs and I O Up to 25 6 GB s per link PCI Express connections for Graphics cards and other extension boards Up to 8 GB s per slot Disk drives attached with 3Gb s serial ATA link Slower peripherals Ethernet USB Firewire WiFi Bluetooth Audio 5 5 2009 CS152 Spring 09 3 Building Blocks to support Family of processors 5 5 2009 CS152 Spring 09 4 Nehalem Die Photo 5 5 2009 CS152 Spring 09 5 Nehalem Memory Hierarchy Overview 32KB L1 I CPU Core Private L1 L2 per core Local memory access latency 60ns 32KB L1 I 4 8 Cores CPU Core 32KB L1 D 32KB L1 D 256KB L2 256KB L2 8MB Shared L3 DDR3 DRAM Memory Controllers QuickPath System Interconnect Each DRAM Channel is 64 72b wide at up to 1 33Gb s 5 5 2009 L3 fully inclusive of higher levels but L2 not inclusive of L1 CS152 Spring 09 Other sockets caches kept coherent using QuickPath messages Each direction is 20b 6 4Gb s 6 All Sockets can Access all Data 60ns QuickTime and a TIFF Uncompressed decompressor are needed to see this picture 100ns 5 5 2009 CS152 Spring 09 7 In Order Fetch In Order Decode and Register Renaming In Order Commit Out of Order Execution 2 SMT Threads per Core 5 5 2009 Out of Order Completion CS152 Spring 09 8 Front End Instruction Fetch Decode x86 instruction bits internal OP bits OP is Intel name for internal RISC like instruction into which x86 instructions are translated 5 5 2009 CS152 Spring 09 Loop Stream Detector can run 9 short loops out of the buffer Branch Prediction Part of instruction fetch unit Several different types of branch predictor Details not public Two level BTB Loop count predictor How many backwards taken branches before loop exit Also predictor for length of microcode loops e g string move Return Stack Buffer Holds subroutine targets Renames the stack buffer so that it is repaired after mispredicted returns Separate return stack buffer for each SMT thread 5 5 2009 CS152 Spring 09 10 x86 Decoding Translate up to 4 x86 instructions into uOPS each cycle Only first x86 instruction in group can be complex maps to 1 4 uOPS rest must be simple map to one uOP Even more complex instructions jump into microcode engine which spits out stream of uOPS 5 5 2009 CS152 Spring 09 11 Split x86 in small uOPs then fuse back into bigger units 5 5 2009 CS152 Spring 09 12 Loop Stream Detectors save Power 5 5 2009 CS152 Spring 09 13 Out of Order Execution Engine Renaming happens at uOP level not original macro x86 instructions 5 5 2009 CS152 Spring 09 14 SMT effects in OoO Execution Core Reorder buffer remembers program order and exception status for in order commit has 128 entries divided statically and equally between both SMT threads Reservation stations instructions waiting for operands for execution have 36 entries competitively shared by threads 5 5 2009 CS152 Spring 09 15 Nehalem Virtual Memory Details Implements 48 bit virtual address space 40 bit physical address space Two level TLB I TLB L1 has shared 128 entries 4 way associative for 4KB pages plus 7 dedicated fully associative entries per SMT thread for large page 2 4MB entries D TLB L1 has 64 entries for 4KB pages and 32 entries for 2 4MB pages both 4 way associative dynamically shared between SMT threads Unified L2 TLB has 512 entries for 4KB pages only also 4 way associative Additional support for system level virtual machines 5 5 2009 CS152 Spring 09 16 Core s Private Memory System Load queue 48 entries Store queue 32 entries Divided statically between SMT threads Up to 16 outstanding misses in flight per core 5 5 2009 QuickTime and a TIFF Uncompressed decompressor are needed to see this picture CS152 Spring 09 17 5 5 2009 CS152 Spring 09 18 Core Area Breakdown 5 5 2009 CS152 Spring 09 19 QuickTime and a TIFF Uncompressed decompressor are needed to see this picture 5 5 2009 CS152 Spring 09 20 Quiz Results 5 5 2009 CS152 Spring 09 21 Related Courses CS CS258 258 Parallel Architectures Languages Systems Strong CS61C CS61C Prerequisite Basic computer organization first look at pipelines caches 5 5 2009 CS CS152 152 Computer Architecture First look at parallel architectures CS CS252 252 Graduate Computer Architecture Advanced Topics CS CS150 150 CS CS250 250 Digital Logic Design Complex Digital Design chip design CS152 Spring 09 22 Advice Get involved in research E g RAD Lab data center Par Lab parallel clients Undergrad research experience is the most important part of application to top grad schools 5 5 2009 CS152 Spring 09 23 End of CS152 Final Quiz 6 on Thursday lectures 19 20 21 HKN survey to follow Thanks for all your feedback we ll keep trying to make CS152 better 5 5 2009 CS152 Spring 09 24
View Full Document
Unlocking...