Berkeley COMPSCI 152 - Lecture Notes - D2267323

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture Notes

DOC PREVIEW

Berkeley COMPSCI 152 - Lecture Notes

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel NehalemIntel NehalemNehalem System Example: Apple Mac Pro Desktop 2009Building Blocks to support “Family” of processorsNehalem Die PhotoNehalem Memory Hierarchy OverviewAll Sockets can Access all DataPowerPoint PresentationFront-End Instruction Fetch & DecodeBranch Predictionx86 DecodingSplit x86 in small uOPs, then fuse back into bigger unitsLoop Stream Detectors save PowerOut-of-Order Execution EngineSMT effects in OoO Execution CoreNehalem Virtual Memory DetailsCore’s Private Memory SystemSlide 18Core Area BreakdownSlide 20Quiz ResultsRelated CoursesAdvice: Get involved in researchEnd of CS152CS 152 Computer Architectureand Engineering Lecture 23:Putting it all together:Intel NehalemKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs1525/5/20092CS152-Spring’09Intel Nehalem•Review entire semester by looking at most recent microprocessor from Intel•Nehalem is code name for microarchitecture at heart of Core i7 and Xeon 5500 series server chips•First released at end of 20085/5/20093CS152-Spring’09Nehalem System Example:Apple Mac Pro Desktop 2009Two Nehalem Chips (“Sockets”), each containing four processors (“cores”) running at up to 2.93GHzEach chip has three DRAM channels attached, each 8 bytes wide at 1.066Gb/s (3*8.5GB/s).Can have up to two DIMMs on each channel (up to 4GB/DIMM)“QuickPath” point-point system interconnect between CPUs and I/O.Up to 25.6 GB/s per link.PCI Express connections for Graphics cards and other extension boards. Up to 8 GB/s per slotDisk drives attached with 3Gb/s serial ATA linkSlower peripherals (Ethernet, USB, Firewire, WiFi, Bluetooth, Audio)5/5/20094CS152-Spring’09Building Blocks to support “Family” of processors5/5/20095CS152-Spring’09Nehalem Die Photo5/5/20096CS152-Spring’09Nehalem Memory Hierarchy OverviewCPU Core32KB L1 D$32KB L1 I$256KB L2$8MB Shared L3$CPU Core32KB L1 D$32KB L1 I$256KB L2$4-8 CoresDDR3 DRAM Memory ControllersQuickPath System InterconnectEach direction is [email protected]/sEach DRAM Channel is 64/72b wide at up to 1.33Gb/sPrivate L1/L2 per coreL3 fully inclusive of higher levels (but L2 not inclusive of L1)Other sockets’ caches kept coherent using QuickPath messagesLocal memory access latency ~60ns5/5/20097CS152-Spring’09All Sockets can Access all DataQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.~60ns~100ns5/5/20098CS152-Spring’09In-Order FetchIn-Order Decode and Register RenamingOut-of-Order ExecutionIn-Order CommitOut-of-Order Completion2 SMT Threads per Core5/5/20099CS152-Spring’09Front-End Instruction Fetch & DecodeµOP is Intel name for internal RISC-like instruction, into which x86 instructions are translatedx86 instruction bitsinternal µOP bitsLoop Stream Detector (can run short loops out of the buffer)5/5/200910CS152-Spring’09Branch Prediction•Part of instruction fetch unit•Several different types of branch predictor–Details not public•Two-level BTB•Loop count predictor–How many backwards taken branches before loop exit–(Also predictor for length of microcode loops, e.g., string move)•Return Stack Buffer–Holds subroutine targets–Renames the stack buffer so that it is repaired after mispredicted returns–Separate return stack buffer for each SMT thread5/5/200911CS152-Spring’09x86 Decoding•Translate up to 4 x86 instructions into uOPS each cycle•Only first x86 instruction in group can be complex (maps to 1-4 uOPS), rest must be simple (map to one uOP) •Even more complex instructions, jump into microcode engine which spits out stream of uOPS5/5/200912CS152-Spring’09Split x86 in small uOPs, then fuse back into bigger units5/5/200913CS152-Spring’09Loop Stream Detectors save Power5/5/200914CS152-Spring’09Out-of-Order Execution EngineRenaming happens at uOP level (not original macro-x86 instructions)5/5/200915CS152-Spring’09SMT effects in OoO Execution Core•Reorder buffer (remembers program order and exception status for in-order commit) has 128 entries divided statically and equally between both SMT threads•Reservation stations (instructions waiting for operands for execution) have 36 entries competitively shared by threads5/5/200916CS152-Spring’09Nehalem Virtual Memory Details•Implements 48-bit virtual address space, 40-bit physical address space•Two-level TLB•I-TLB (L1) has shared 128 entries 4-way associative for 4KB pages, plus 7 dedicated fully-associative entries per SMT thread for large page (2/4MB) entries•D-TLB (L1) has 64 entries for 4KB pages and 32 entries for 2/4MB pages, both 4-way associative, dynamically shared between SMT threads•Unified L2 TLB has 512 entries for 4KB pages only, also 4-way associative•Additional support for system-level virtual machines5/5/200917CS152-Spring’09Core’s Private Memory SystemQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.Load queue 48 entriesStore queue 32 entriesDivided statically between SMT threadsUp to 16 outstanding misses in flight per core5/5/200918CS152-Spring’095/5/200919CS152-Spring’09Core Area Breakdown5/5/200920CS152-Spring’09QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.5/5/200921CS152-Spring’09Quiz Results5/5/200922CS152-Spring’09Related CoursesCS61CCS61CCS 152CS 152CS 258CS 258CS 150CS 150Basic computer organization, first look at pipelines + cachesComputer Architecture, First look at parallel architecturesParallel Architectures,Languages, SystemsDigital Logic DesignStrongPrerequisiteCS 250CS 250Complex Digital Design (chip design)CS 252CS 252Graduate Computer Architecture, Advanced Topics5/5/200923CS152-Spring’09Advice: Get involved in researchE.g., •RAD Lab - data center•Par Lab - parallel clients•Undergrad research experience is the most important part of application to top grad schools.5/5/200924CS152-Spring’09End of CS152•Final Quiz 6 on Thursday (lectures 19, 20, 21)•HKN survey to follow.•Thanks for all your feedback - we’ll keep trying to make CS152

View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-23-24 out of 24 pages.

Berkeley COMPSCI 152 - Lecture Notes

Sign up for free to view:

Please select your school