Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III - D2994121

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture 19 – Advanced Processors III

DOC PREVIEW

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors III2005-11-3John Lazzaro (www.cs.berkeley.edu/~lazzaro)CS 152 Computer Architecture and EngineeringLecture 19 – Advanced Processors IIIwww-inst.eecs.berkeley.edu/~cs152/TAs: David Marquardt and Udam SainiUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIReorderBufferInst # [...]src1 #src1 valsrc2 #src2 valdest #dest val67[...]StoreUnitTo MemoryLoadUnitFrom MemoryALU #1 ALU #2Each lineholds physical<src1, src2, dest>registersfor an instruction,and controlswhen it executesExecution engine works on the physicalregisters, not the architecture registers.Execution engine works on the physicalregisters, not the architecture registers.Common Data Bus: <reg #, reg val>Common Data Bus: <reg #, reg val>Last Time: Dynamic SchedulingUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors III Today: Throughput and multiple threadsGoal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.Ultimate limiter: Amdahl’s law (application dependent). Memory system performance.Example: Sun Niagara (32 instruction streams on a chip).UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIThroughput ComputingMultithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.Multi-core: Integrating several processors that (partially) share a memory system on the same chipUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMulti-Threading(static pipelines)UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMux,LogicRecall: Bypass network prevents stallsIRIRBAMIRYMIRRWE, MemToRegID (Decode) EXMEMWBFrom WBInstead of bypass: interleave threads on the pipeline to prevent stalls ...Instead of bypass: interleave threads on the pipeline to prevent stalls ...UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIIntroduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clock4 CPUs,each run at 1/4 clockMany variants ...Many variants ...UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMulti-Threading(dynamic scheduling)UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIPower 4 (predates Power 5 shown Tuesday)Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIFor most apps, most execution units lie idleFrom: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.For an 8-way superscalar.Observation:Most hardware in an out-of-order CPU concernsphysical registers. Could severalinstruction threads share this hardware?UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIISimultaneous Multi-threading ...123456789M M FX FX FP FP BR CCCycleOne thread, 8 unitsM = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes123456789M M FX FX FP FP BR CCCycleTwo threads, 8 unitsUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIPower 4Power 4Power 5Power 52 fetch (PC),2 initial decodes2 fetch (PC),2 initial decodes2 commits(architected register sets)2 commits(architected register sets)UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIPower 5 data flow ...Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck. Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIPower 5 thread performance ...Relative priority of each thread controllable in hardware.Relative priority of each thread controllable in hardware.For balanced operation, both threads run slower than if they “owned” the machine.For balanced operation, both threads run slower than if they “owned” the machine.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIThis Friday: Memory System CheckoffInstruction CacheData CacheDRAMDRAM ControllerIC BusIM BusDC BusDM BusTestVectorsRun your test vector suite on the Calinx board, display results on LEDsUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMulti-CoreUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIRecall: Superscalar utilization by a threadFor an 8-way superscalar.Observation:In many cases, the on-chip cache and DRAM I/Obandwidth is also underutilized by one CPU. So, let 2 cores share them.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIMost of Power 5 die is shared hardwareCore #1Core #1Core #2Core #2SharedComponentsL2 CacheL3 Cache ControlDRAMControllerSharedComponentsL2 CacheL3 Cache ControlDRAMControllerUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIICore-to-core interactions stay on chip(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.(1) Threads on two cores that use shared libraries conserve L2 memory.(1) Threads on two cores that use shared libraries conserve L2 memory.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIISun NiagaraUC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIIThe case for Sun’s Niagara ...For an 8-way superscalar.Observation:Some apps struggle to reach a CPI <= 1. For throughput on these apps,a large number of single-issue cores is better than a few superscalars.UC Regents Fall 2005 © UCBCS 152 L19: Advanced Processors IIINiagara: 32 threads on one chip8 cores:Single-issue6-stage pipeline4-way multi-threadedFast crypto support8 cores:Single-issue6-stage pipeline4-way multi-threadedFast crypto supportShared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet portsShared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet portsSources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)Die size: 340 mm² in 90 nm.Power: 50-60 WUC Regents Fall 2005 © UCBCS 152

View Full Document

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Sign up for free to view:

Please select your school