DOC PREVIEW
Berkeley COMPSCI 152 - Lecture Notes

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

April 27, 2010 CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs152April 27, 2010 CS152, Spring 2010 2 Intel Nehalem • Review entire semester by looking at most recent microprocessor from Intel • Nehalem is code name for microarchitecture at heart of Core i7 and Xeon 5500 series server chips • First released at end of 2008 • Figures/Info from Intel, David Kanter at Real World Technologies.April 27, 2010 CS152, Spring 2010 3 Nehalem System Example: Apple Mac Pro Desktop 2009 Two Nehalem Chips (“Sockets”), each containing four processors (“cores”) running at up to 2.93GHz Each chip has three DRAM channels attached, each 8 bytes wide at 1.066Gb/s (3*8.5GB/s). Can have up to two DIMMs on each channel (up to 4GB/DIMM) “QuickPath” point-point system interconnect between CPUs and I/O. Up to 25.6 GB/s per link. PCI Express connections for Graphics cards and other extension boards. Up to 8 GB/s per slot. Disk drives attached with 3Gb/s serial ATA link Slower peripherals (Ethernet, USB, Firewire, WiFi, Bluetooth, Audio)April 27, 2010 CS152, Spring 2010 4 Building Blocks to support “Family” of processorsApril 27, 2010 CS152, Spring 2010 5 Nehalem Die PhotoApril 27, 2010 CS152, Spring 2010 6 In-Order Fetch In-Order Decode and Register Renaming Out-of-Order Execution In-Order Commit Out-of-Order Completion 2 SMT Threads per CoreApril 27, 2010 CS152, Spring 2010 7 Front-End Instruction Fetch & Decode µOP is Intel name for internal RISC-like instruction, into which x86 instructions are translated x86 instruction bits internal µOP bits Loop Stream Detector (can run short loops out of the buffer)April 27, 2010 CS152, Spring 2010 8 Branch Prediction • Part of instruction fetch unit • Several different types of branch predictor – Details not public • Two-level BTB • Loop count predictor – How many backwards taken branches before loop exit – (Also predictor for length of microcode loops, e.g., string move) • Return Stack Buffer – Holds subroutine targets – Renames the stack buffer so that it is repaired after mispredicted returns – Separate return stack buffer for each SMT threadApril 27, 2010 CS152, Spring 2010 9 x86 Decoding • Translate up to 4 x86 instructions into uOPS each cycle • Only first x86 instruction in group can be complex (maps to 1-4 uOPS), rest must be simple (map to one uOP) • Even more complex instructions, jump into microcode engine which spits out stream of uOPSApril 27, 2010 CS152, Spring 2010 10 Split x86 in small uOPs, then fuse back into bigger unitsApril 27, 2010 CS152, Spring 2010 11 Loop Stream Detectors save PowerApril 27, 2010 CS152, Spring 2010 12 Out-of-Order Execution Engine Renaming happens at uOP level (not original macro-x86 instructions)April 27, 2010 CS152, Spring 2010 13 SMT effects in OoO Execution Core • Reorder buffer (remembers program order and exception status for in-order commit) has 128 entries divided statically and equally between both SMT threads • Reservation stations (instructions waiting for operands for execution) have 36 entries competitively shared by threadsApril 27, 2010 CS152, Spring 2010 14 Nehalem Memory Hierarchy Overview CPU Core 32KB L1 D$ 32KB L1 I$ 256KB L2$ 8MB Shared L3$ CPU Core 32KB L1 D$ 32KB L1 I$ 256KB L2$ 4-8 Cores DDR3 DRAM Memory Controllers QuickPath System Interconnect Each direction is [email protected]/s Each DRAM Channel is 64/72b wide at up to 1.33Gb/s Private L1/L2 per core L3 fully inclusive of higher levels (but L2 not inclusive of L1) Other sockets’ caches kept coherent using QuickPath messages Local memory access latency ~60nsApril 27, 2010 CS152, Spring 2010 15 All Sockets can Access all Data ~60ns ~100nsApril 27, 2010 CS152, Spring 2010 16 Core’s Private Memory System Load queue 48 entries Store queue 32 entries Divided statically between SMT threads Up to 16 outstanding misses in flight per coreApril 27, 2010 CS152, Spring 2010 17April 27, 2010 CS152, Spring 2010 Cache Hierarchy Latencies • L1 32KB 8-way, latency 4 cycles • L2 256KB 8-way, latency <12 cycles • L3 8MB, 16-way, latency 30-40 cycles • DRAM, latency ~180-200 cycles 18April 27, 2010 CS152, Spring 2010 19 Nehalem Virtual Memory Details • Implements 48-bit virtual address space, 40-bit physical address space • Two-level TLB • I-TLB (L1) has shared 128 entries 4-way associative for 4KB pages, plus 7 dedicated fully-associative entries per SMT thread for large page (2/4MB) entries • D-TLB (L1) has 64 entries for 4KB pages and 32 entries for 2/4MB pages, both 4-way associative, dynamically shared between SMT threads • Unified L2 TLB has 512 entries for 4KB pages only, also 4-way associative • Additional support for system-level virtual machinesApril 27, 2010 CS152, Spring 2010 Virtualization Support • TLB entries tagged with virtual machine and address space ID – No need to flush on context switches between VMs • Hardware page table walker can walk guest-physical to host-physical mapping tables – Fewer traps to hypervisor 20April 27, 2010 CS152, Spring 2010 21 Core Area BreakdownApril 27, 2010 CS152, Spring 2010 22April 27, 2010 CS152, Spring 2010 23 Related Courses CS61C CS 152 CS 258 CS 150 Basic computer organization, first look at pipelines + caches Computer Architecture, First look at parallel architectures Parallel Architectures, Languages, Systems Digital Logic Design Strong Prerequisite CS 250 Complex Digital Design (chip design) CS 252 Graduate Computer Architecture, Advanced TopicsApril 27, 2010 CS152, Spring 2010 24 Advice: Get involved in research E.g., • RAD Lab - data center • Par Lab - parallel clients • AMP Lab – algorithms, machines, people • LoCAL – networking energy • Undergrad research experience is the most important part of application to top grad schools, and fun too.April 27, 2010 CS152, Spring 2010 25 End of CS152 • Final Quiz 5 on Thursday (lectures 19, 20, 21) • HKN survey to follow. • Thanks for all your feedback - we’ll keep trying to make CS152


View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?