Spring 2010 Prof. Hyesoon Kim• Reading assignment [LRB] • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• Economic reason: Game market is getting bigger • Technology reason: Power wall, memory wall. Heterogeneous architecture shows promising resultsDetour: OOO• 4-way SMT in order processor with cache coherence • Extended X86 ISA• Fixed functions: texture filteringhttp://users.ece.gatech.edu/~lanterma/mpg08/Larrabee_ECE4893.pdf• Pentium processor in-order • Extended X86 (64bit, new instructions)• 4-way SMT • 32KB I-cache, 32KB D-cache (statically partitioned) • 256KB Locale L2 cache (subset of L2 ache)• U-pipe V-pipe • Primary pipeline: All instructions (U-pipe)• Secondary: Limited instructions (V-pipe)loads, stores, simple ALU operations, branches, cache manipulation instructions, and vector stores.• Reply on compiler’s paring – VLIWish again• Use cache as extended register file storage • Target for Stream applications• Each core can– Fast-access its local subset of L2 (256KB) – Access other’s L2 shares too• Control for non-temporal streaming data (SSE)• Prefetch to L1, or L2 only•Mark a streaming cache line for early eviction•Render target kept in L2 (e.g., FB, ZB, SB, etc)• Global L2 cache is divided into 256KB Local L2 per core • Data written by a CPU core is stored in its own L2 cache subset and is flushed from other subsets, if necessary• Bi-direction Ring network (<16) – Even cycle, odd cycle: one clock per one hop • Each ring data-path is 512-bits wide per direction• L2 cache Insertion requires cache coherence checking • Memory and fixed function access• 16-wide SIMD unit– 16 wide Single precision – 8-wide double precision • Hardware scatter/gathering operations : 16 elements are loaded from or stored to up to 16 different addresses that are specified in another vector register.• New instructions: fused multiply-add, and the standard logical operations, including instructions to extract non-byte-aligned• Data can be replicated from L2 cache directly • Free numeric type conversion and data• replication while reading from memory• Mask registers: predicated • 3 source operands, one of them can come from L1 directly• Loads and stores from non-continuous addresses • 16 data values can be loaded or stored from addresses in another vector register value.012345678910111213141535791310148920311264435AddrMemory dest• All the operations participated the computation in vector units• Mask bits decide write enable signal 1110011111110000Src1Src2 destSrc3 Write_enable012345678910111213141535791310148920311264435AddrMemory dest1110011111110000LDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDMASK_bit• Use FIFO for load balancing• No rasterizations• Texture filtering – 32KB texture cache per each core – Core passes commands through L2 cache• Texture unites perform virtual to physical page translationhttp://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• Implementation hierarchy (your names may vary)– Strand: runs a data kernel in one (or more) VPU lane(s)– Fiber: SW-managed group of strands, like co-routines– Thread: HW-managed, swaps fibers to cover long latencies• Core: runs multiple threads to cover short latencies• Comparison to GPU data parallelism– Same mechanisms as used in GPUs, except…– Larrabee allows SW scheduling (except for HW threads)http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• New Data types• Vector arithmetic, logical, and shift• Vector mask generation• Vector load/store• Stream data handles– Cache control instructions • Prefetch into L1 or L2 caches • Early evictions • CPU/GPU memory space• Team members (1-2) • Schedule – 4/12 (1 paragraph project proposal)– 4/14 proposal 1stfeedback– 4/16 detailed proposal description meeting – 4/23 progress meeting – 4/28 final project presentation • 10 min for each team – 4/30 project submission• Programs – Game related programs – CUDA, Nintendo DS • Architecture – Architecture survey– Hardware architecture simple
View Full Document