DOC PREVIEW
GT CS 4803 - DESIGN GAME CONSOLE
School name Georgia Tech
Pages 22

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Spring 2010 Prof. Hyesoon Kim• Reading assignment [LRB] • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• Economic reason: Game market is getting bigger • Technology reason: Power wall, memory wall.  Heterogeneous architecture shows promising resultsDetour: OOO• 4-way SMT in order processor with cache coherence • Extended X86 ISA• Fixed functions: texture filteringhttp://users.ece.gatech.edu/~lanterma/mpg08/Larrabee_ECE4893.pdf• Pentium processor in-order • Extended X86 (64bit, new instructions)• 4-way SMT • 32KB I-cache, 32KB D-cache (statically partitioned) • 256KB Locale L2 cache (subset of L2 ache)• U-pipe V-pipe • Primary pipeline: All instructions (U-pipe)• Secondary: Limited instructions (V-pipe)loads, stores, simple ALU operations, branches, cache manipulation instructions, and vector stores.• Reply on compiler’s paring – VLIWish again• Use cache as extended register file storage • Target for Stream applications• Each core can– Fast-access its local subset of L2 (256KB) – Access other’s L2 shares too• Control for non-temporal streaming data (SSE)• Prefetch to L1, or L2 only•Mark a streaming cache line for early eviction•Render target kept in L2 (e.g., FB, ZB, SB, etc)• Global L2 cache is divided into 256KB Local L2 per core • Data written by a CPU core is stored in its own L2 cache subset and is flushed from other subsets, if necessary• Bi-direction Ring network (<16) – Even cycle, odd cycle: one clock per one hop • Each ring data-path is 512-bits wide per direction• L2 cache Insertion requires cache coherence checking • Memory and fixed function access• 16-wide SIMD unit– 16 wide Single precision – 8-wide double precision • Hardware scatter/gathering operations : 16 elements are loaded from or stored to up to 16 different addresses that are specified in another vector register.• New instructions: fused multiply-add, and the standard logical operations, including instructions to extract non-byte-aligned• Data can be replicated from L2 cache directly • Free numeric type conversion and data• replication while reading from memory• Mask registers: predicated • 3 source operands, one of them can come from L1 directly• Loads and stores from non-continuous addresses • 16 data values can be loaded or stored from addresses in another vector register value.012345678910111213141535791310148920311264435AddrMemory dest• All the operations participated the computation in vector units• Mask bits decide write enable signal 1110011111110000Src1Src2 destSrc3 Write_enable012345678910111213141535791310148920311264435AddrMemory dest1110011111110000LDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDLDMASK_bit• Use FIFO for load balancing• No rasterizations• Texture filtering – 32KB texture cache per each core – Core passes commands through L2 cache• Texture unites perform virtual to physical page translationhttp://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• Implementation hierarchy (your names may vary)– Strand: runs a data kernel in one (or more) VPU lane(s)– Fiber: SW-managed group of strands, like co-routines– Thread: HW-managed, swaps fibers to cover long latencies• Core: runs multiple threads to cover short latencies• Comparison to GPU data parallelism– Same mechanisms as used in GPUs, except…– Larrabee allows SW scheduling (except for HW threads)http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Seiler.pdf• New Data types• Vector arithmetic, logical, and shift• Vector mask generation• Vector load/store• Stream data handles– Cache control instructions • Prefetch into L1 or L2 caches • Early evictions • CPU/GPU memory space• Team members (1-2) • Schedule – 4/12 (1 paragraph project proposal)– 4/14 proposal 1stfeedback– 4/16 detailed proposal description meeting – 4/23 progress meeting – 4/28 final project presentation • 10 min for each team – 4/30 project submission• Programs – Game related programs – CUDA, Nintendo DS • Architecture – Architecture survey– Hardware architecture simple


View Full Document

GT CS 4803 - DESIGN GAME CONSOLE

Download DESIGN GAME CONSOLE
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DESIGN GAME CONSOLE and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DESIGN GAME CONSOLE 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?