GT ECE 4893 - A Many-Core x86 Architecture for Visual Computing from Intel - D2208488

Home> Schools> Georgia Tech> Electrical & Computer Engr (ECE) > ECE 4893> A Many-Core x86 Architecture for Visual Computing from Intel

DOC PREVIEW

GT ECE 4893 - A Many-Core x86 Architecture for Visual Computing from Intel

School name Georgia Tech

Course Ece 4893- Special Topics

Pages 23

This preview shows page 1-2-22-23 out of 23 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 23 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Larrabee: A Many-Core x86 Architecture for Larrabee: A ManyCore x86 Architecture for Visual Computing from IntelProf. Hsien-Hsin S. LeeProf. HsienHsin S. LeeSchool of Electrical and Computer EngineeringGeorgia TechDisclaimer: The materials of this presentation were obtained from available resources including In-Stat’s Microprocessor Report, Intel’s Larrabee paper published in SIGGRAPH 2008, and certain speculations from both the presenter and several online reviews. The opinions suggested by this slide set are purely the presenter’s interpretation from these materials, they do not necessarily represent the official opinions of Intel, Nvidia, or Georgia Tech.Vision, Ambition, and Design Goals• Intel: Software is the New Hardware !• Intel: x86 ISA makes parallel program easier – Better flexibility and programmability– Support subroutine call and page faultingMostly software rendering pipeline except texture –Mostly software rendering pipeline, except texture filtering• Note that, general goal for current day GPGPU designers (well, also Intel’s Larrabee architects)f 222–– ↑↑ performance per mm2–– ↑↑ performance per wattThe Larrabee Architecture...In-orderX86+SIMDIn-orderX86+SIMDIn-orderX86+SIMDIn-orderX86+SIMD...n LogicMemoryCoherentL2$CoherentL2$CoherentL2$CoherentL2$......xed Functiony & I/O InterfL2$L2$L2$L2$CoherentL2$CoherentL2$CoherentL2$CoherentL2$...FixfaceIn-orderX86+SIMDIn-orderX86+SIMDIn-orderX86+SIMDIn-orderX86+SIMDL2$L2$L2$L2$• Lots of x86 cores (8 to 64?)•Fully coherence cache hierarchy...X86+SIMDX86+SIMDX86+SIMDX86+SIMD3•Fully coherence cache hierarchy33Programmable Pipeline ComparisonPi lInput DataPrimitive Setup& Rasterization Frame BufferBlendVertex ShaderPixelShaderGeometryShaderConventional GPGPU pipeline (base on DirectX10)Vertex Sh dPixelShaderGeometrySh dPrimitive SetupFrameBuffererizationInput ShaderShaderShaderSetupBlendRasteLarrabee’s fully programmable pipelineData44yp g ppX86 Core• LRB’s “in-order” core isThe original Pentium (p54c, i.e., pre-MMX)+ 64bit extensions+ Larger L1 caches + a shared L2+ 4-way multi-threading+ 16-wide VPU (Vector Processing Unit)• Rumor has it: this is the thoroughly debugged P54C given back by Pentagon who got the original RTL from Intel to back by Pentagon who got the original RTL from Intel to develop their radiation hardened version (which I really doubt)55• Compatibility is the keywordSingle Larrabee CoreeVectorUnitVectorRegisters2KB)ds))on DecodeRand D$ (32rt 4 threadcal subset2$ (256KB)networkInstructiScalararters1 I$ (32KB)(to suppoOne LocOf the L2Ring ScalarUnitScalaRegistL66Dual Issue Core• Rely on compiler to pair two instructions for asymmetric pipes– Same as P54C– Primary instruction pipe (U-pipe)•All instructions– Secondary, more restricted pipe (V-pipe)•ld st simple ALU Ops Brs cache manipulation ld, st, simple ALU Ops, Brs, cache manipulation instructions, vector st77• 1GHz, 32 cores to reach 1 TeraFLOPSShared L2, Divided L2• Each core has a local L2 subset– 256KB each– Enable parallel lookup among cores• One core can access others’ subsets directly• Entire L2 is coherent (no hassle like Cell DMA)• SIGGRAPH paper shows a 4MB L2 indicating 16 cores88Cache Control Instructions• Each core can–Fast-access its local subset of L2 (256KB)Fastaccess its local subset of L2 (256KB)– Access other’s L2 shares too •Control for nontemporal streaming data (SSE)•Control for non-temporal streaming data (SSE)• Prefetch to L1, or L2 only• Mark a streaming cache line for early eviction•Render target kept in L2 (e.g., FB, ZB, SB, etc)gp (g,,,,)99Ring Network•Bi-directional ring networkBidirectional ring network– All cores, L2, block of FF logic are attached to– 512-bit wide each direction– Simpler than mesh, easy wire routing• One clock cycle for each stop (a hop)– Number of nodes between two parties determine latencies– Worst case: halfway around the ring• Ring latency is small compared to DRAM access• When > 16 cores: multiple, hierarchical rings will be 1010needed (think about KSR MPP)4-Way MT• Four x86 contexts to support 4 hardware threads• One thread picked per clock• MT is especially helpful–When compiler fails to schedule code without stallsWhen compiler fails to schedule code without stalls– Upon L1 misses– Can hide long vector instruction latency– Can switch thread on every clock1111Larrabee Multithread ModelddddddddddddddddddddddddddddddddFiber: Software-Managed Context (Hide long predictable Latencies)Thread: Hardware-Managed Context (Hid short unpredictable Latencies)strandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrandstrand16-Wide Vector unitMore Fibers (Typically up to 8, Depending on Latency to Cover)More Threads (Up to 4 per Core, Share Memory via L1 and L2 Caches)1212Source: MPRVPU (1/2)• 16-wide Integer / single-precision FP • 8-wide double-precision FP• Ternary operandsO f –One source can come from memory •Free predication on every instructionFree predication on every instruction– 16-bit predicate registers ─ one “enable” per lane• Gather/scatter instructions– Read/write 16 results to/from 16 different offsets1313• 1/3 the area of the LRB core!!!VPU (2/2)ctorgisterUSwizzleSwizzleURegistersVecRegVector ALUVector ALUMask R16-wide Vatecate16-wide VReplicReplic1414Fixed Function Logic (1/3)• Modern GPGPU have the following done in HW– Texture filtering, display processing, post-shader alpha blending, rasterization, interpolation, etc.• LRB do all in SW except Texture Sampler Units– Much faster than software approach (12x ~ 40x)• Texture filtering still most commonly uses 8-bit operations• Efficiently selecting unaligned 2x2 quad requires a specialized pipelined gather logicpppgg• Filtering on VPU requires an impractical amount of RF b/w.• On-the-fly texture decompression drastically more efficient in dedicated hardware1515in dedicated hardwareFixed Function Logic (2/3)• Similar to typical GPU texture logic–32KB texture cache per core32KB texture cache per core– Supports all the usual operations•DX10 compressed texture formatDX10 compressed texture format•Mipmapping•Anisotropic filtering1616Fixed Function Logic (3/3)• Core pass commands to the texture units through the L2$ and receive

View Full Document