UCSB CS 240A - Computer Symposium - D2441033

Home> Schools> University of California, Santa Barbara> (CS) > CS 240A> Computer Symposium

UCSB CS 240A - Computer Symposium

School name University of California, Santa Barbara

Course Cs 240a- Applied Parallel Computing

Pages 146

Download Save

Unformatted text preview:

Jim HeldIntel Fellow & Director Tera-scale Computing Research“Single-chip Cloud Computer”An experimental many-core processor from Intel LabsIntel Labs Single-chip Cloud Computer Symposium February 12, 2010Agenda10:00 Welcome and Opening Remarks 10:15 SCC Hardware Architecture Overview 11:15 Today’s SCC Software Environment12:15 Buffet Lunch – Informal discussions 13:15 Message Passing on the SCC 13:45 Software-Managed Coherency 14:15 Application ”Deep Dive”: Javascript Farm on SCC14:45 Break15:00 Plans for future SCC access15:30 Q&A16:30 Adjourn2Motivations for SCC•Many-core processor research–High-performance power-efficient fabric–Fine-grain power management–Message-based programming support•Parallel Programming research–Better support for scale-out model servers> Operating system, communication architecture–Scale-out programming model for client> Programming languages, runtimes3Agenda10:00 Welcome and Opening Remarks 10:15 SCC Hardware Architecture Overview 11:15 Today’s SCC Software Environment12:15 Buffet Lunch – Informal discussions 13:15 Message Passing on the SCC 13:45 Software-Managed Coherency 14:15 Application ”Deep Dive”: Javascript Farm on SCC14:45 Break15:00 Plans for future SCC access15:30 Q&A16:30 Adjourn4Jason HowardAdvanced Microprocessor ResearchIntel LabsSCC Architecture and Design OverviewIntel Labs Single-chip Cloud Computer Symposium February 12, 2010Agenda• Feature set• Architecture overview – Core– Interconnect Fabric – Memory model & Message passing– I/O and System Overview • Design Overview– Tiled design methodology– Clocking– Power management• Results• Summary6SCC Feature set• First Si with 48 iA cores on a single die• Power envelope 125W Core @1GHz, Mesh @2GHz• Message passing architecture> No coherent shared memory> Proof of Concept for scalable solution for many core• Next generation 2D mesh interconnect> Bisection B/W 1.5Tb/s to 2Tb/s, avg. power 6W to 12W• Fine grain dynamic power management > Off-die VRs7MC0 MC1MC2 MC3System InterfaceVRCRouterIA-32 Core0L2$0256KBL2$1256KBIA-32 Core1MPB16KBRouter TileDie Architecture2 core clusters in 6x4 2-D mesh16B8Core Memory Management• Core cache coherency is restricted to private memory space– Maintaining cache coherency for shared memory space is under software control• Each core has an address Look Up Table (LUT) extension– Provides address translation and routing information• LUT must fit within the core and memory controller constraints• LUT boundaries are dynamically programmedSharedBoot1GBPrivateMaps to MC0Maps to VRCMaps to MPBsMaps to MC0CORE0 LUT Example01254255……9On-Die 2D Mesh• 16B wide data links + 2B sideband> Target frequency: 2GHz> Bisection bandwidth: 2 Tb/s> Latency: 4 cycles (2ns)• 2 message classes and 8 VCs• Low power circuit techniques> Sleep, clock gating, voltage control, low power RF> Low power 5 port crossbar design• Speculative VC allocation• Route pre-computation• Single cycle switch allocation10Input Arbitration Switch Arbitration FIFORoute Pre-computeVC AllocationCycle 1Cycle 2 Cycle 3 Cycle 4In-Port 0Frequency 2GHz @ 1.1V Latency 4 cyclesLink Width 16 BytesBandwidth 64GB/s per linkArchitecture 8 VCs over 2 MCs Power Consumption500mW @ 50°C16B16BRouter Architecture11Message Passing on SCC•Message passing is done through shared memory space•Two classes of shared memory:–Off-die, DRAM: Uncachable shared memory … results in high latency message passing–On-die, message passing buffers (MPB) … low latency message passing> On-die dedicated message buffers placed in each tile to improve message passing performance> Message bandwidth improved to 1 GB/s12Non-coherent Memory Space Coherent Memory SpaceMessage Passing Protocol16KB Message Passing Buffer1. Data MoveCore A - L1$2. MP write miss6. Data MoveCore B - L1$4. MP read miss3. MPB write5. MPB readMessage Passing Protocol• Cores communicate through small fast messages– L1 to L1 data transfers– New Message Passing Data Type (MPDT)• Message passing Buffer (MPB) – 16KB– 1 MPB per tile for 384KB of on-die shared memory – MPB size coincides with L1 caches13Dedicated Message Buffers• Cache line transfers into L1 cache of receiving core implemented through on-die message passing buffers• Each tile has 16KB MPB • Part of the shared memory space is statically mapped into MPB in each tile rather than into memory controller• Messages larger than MPB can still go out to main memoryMeshSend Core:Data:Req/ResMeshIFMsg BufReceiveCoreMeshIFMsg BufMeshSend Core:Data:Req/ResMeshIFMsg BufReceiveCoreMeshIFMsg BufLocal write, remote read Remote write, local read14System Interface• JTAG access to config system while in reset/debug> Done on Power Reset from Management Console PC> Configuring memory controller etc.> Reset cores with default configuration• Management Console PC can use Mem-mapped registers to modify default behavior> Configuration and voltage control registers> Message passing buffers> Memory mapping• Preload image and reset rather than PC bootstrap> BIOS & firmware a work in progress15SCC system overview16 tile tile tile tile tile tile tile tile tile tileMCMCMCMCDIMMDIMM DIMMDIMMSystem InterfaceRRRRRRRR tile tile tile tile tile tile tile tile tile tile tile tile tile tileRRRRRRRRRRRRRRRRSystem FPGAPCIeManagement Console PCSCC dieJTAGI/OJTAGBUSPLLRPCSCC full chip Technology 45nm ProcessInterconnect 1 Poly, 9 Metal (Cu)Transistors Die: 1.3B, Tile: 48MTile Area 18.7mm2Die Area 567.1mm2171321.4mm26.5mmDDR3MCDDR3MCPLL + I/O RPCDDR3MCDDR3MCJTAGSystem Interface + I/OSCC TILESCCTILEP54cL2$ + CCGCUMIUMPBCCFRouterP54cP54cL2$ + CCGCUMIUMPBL2$ + CCCCF• 2 P54C cores (16K L1$/core)• 256K L2$ per core• 8K Message passing buffer• Clock Crossing FIFOs b/w Mesh interface unit and Router• Tile area 18.7mm2• Core are 3.9mm2• Cores and uncore units @1GHz• Router @2GHzSCC Dual-core Tile 18SIFMC0MC2MC1MC3Router Tile Clock GatingVRCPLLPort_EnRouter_EnL2$1_EnL2$0_EnTile_ENCore1_EnCore1_EnPort CLKsL2$1 clkL2$0 clkCore0 clkCore1 clk[4:0]Clock Distribution• Balanced H-tree clock distribution• Designed to provide 4GHz clock to tile entry points• Simulated skew for adjacent tiles – 5ps• Cross die skew irrelevant1927 Frequency Islands (FI) 8 Voltage Islands (VI)Voltage and Frequency islands20SCC Clock Crossing FIFO (CCF) • 6 entry deep FIFO,

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSB CS 240A - Computer Symposium

Sign up for free to view:

Please select your school