COSC 637 4 Parallel Computations Introduction to CUDA GPU Architectures Edgar Gabriel Fall 2010 Edgar Gabriel References Intel Larrabee 1 L Seiler D Carmean E Sprangle T Forsyth M Abrash P Dubey S Junkins A Lake J Sugerman R Cavin R Espasa E Grochowski T Juan P Hanrahan Larrabee a many core x86 architecture for visual computing ACM Trans Graph Vol 27 No 3 August 2008 pp 1 15 http softwarecommunity intel com UserFiles en us File larrabee manycore pdf Nvidia GT200 2 David Kanter Nvidia GT200 Inside a Parallel Processor http www realworldtech com page cfm ArticleID RWT090808195242 p 109 08 2008 Nvidia Fermi 3 David Kanter Inside Fermi Nvidia s HPC Push http www realworldtech com page cfm ArticleID RWT093009110932 p 1 Parallel Computations Edgar Gabriel 1 Larrabee Motivation Comparison of two architectures with the same number of transistors Half the performance of a single stream for the simplified core 40x increase for multi stream executions 2 out of order cores 10 in order cores Instruction issue 4 2 VPU per core 4 wide SSE 16 wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock Parallel Computations Edgar Gabriel Larrabee Overview Many core visual computing architecture Based on x86 CPU cores Extended version of the regular x86 instruction set Supports subroutines and page faulting Number of x86 cores can vary depending on the implementation and processor version Fixed functional units for texture filtering Other graphical operations such as rasterization or postshader blending done in software Parallel Computations Edgar Gabriel 2 Larrabee Overview II Image Source 1 Parallel Computations Edgar Gabriel Overview of a Larrabee Core I Image Source 1 Parallel Computations Edgar Gabriel 3 Overview of a Larrabee Core I x86 core derived from the Pentium processor No out of order execution Standard Pentium instruction set with the addition of 64 bit instructions Instructions for pre fetching data into L1 and L2 cache Support for 4 simultaneous threads separate registers for each thread Each core is augmented with a wide vector processor VPU 32kb L1 Instruction cache 32 kb L1 Data Cache 256 KB of local subset of the L2 cache Coherent L2 cache across all cores Parallel Computations Edgar Gabriel Vector Processing Unit in Larrabee 16 wide VPU executing integer single and double precision floating point operations VPU supports gather scatter operations The 16 elements are loaded or can be stored from up to 16 different addresses Support for predicated instructions using a mask control register if then else statements Parallel Computations Edgar Gabriel 4 Inter Processor Ring Network Bi directional ring network 512 bits wide per direction Routing decisions done before injecting message into the network Parallel Computations Edgar Gabriel Larrabee Programming Models Most application can be executed without modification due to the full support of the x86 instruction set Support for POSIX threads to create multiple threads API extended by thread affinity parameters Recompiling code with Larrabee s native compiler will generate automatically the codes to use the VPUs Alternative parallel approaches Intel threading building blocks Larrabee specific OpenMP directives Parallel Computations Edgar Gabriel 5 Larrabee Performance Image Source 1 Parallel Computations Edgar Gabriel Nvidia GT200 A GT200 is multi core chip with two level hierarchy focuses on high throughput on data parallel workloads 1st level of hierarchy 10 Thread Processing Clusters TPC 2nd level of hierarchy each TPC has 3 Streaming Multiprocessors SM an SM corresponds to 1 core in a conventional processor a texture pipeline used for memory access Global Block Scheduler issues thread blocks to SMs with available capacity simple round robin algorithm but taking resource availability e g of shared memory into account Parallel Computations Edgar Gabriel 6 Nvidia GT200 Image Source 2 Parallel Computations Edgar Gabriel Nvidia GT200 streaming multiprocessor I Instruction fetch decode and issue logic 8 32bit ALU units that are often referred to as Streaming processor SP or confusingly called a core by Nvidia 8 branch units a thread encountering a branch will stall until it is resolved no speculation branch delay 4 cycles two 64bit special units for less frequent operations 64bit operations 8 12 times slower than 32bit operations 1 special function unit for unusual instructions transcendental functions interpolations reciprocal square roots take anywhere from 16 to 32 cycles to execute Parallel Computations Edgar Gabriel 7 Nvidia GT200 streaming multiprocessor II single issue with SIMD capabilities can execute up to 8 thread blocks 1024 threads concurrently does not support speculative execution or branch prediction Instructions are scoreboarded to reduce stalls Each SP has access to 2048 register file entries each with 32 bits a double precision number has to utilize two adjacent registers register file can be used by up to 128 threads concurrently Parallel Computations Edgar Gabriel Nvidia GT200 streaming multiprocessor III Parallel Computations Edgar Gabriel Image Source 2 8 Nvidia GT200 streaming multiprocessor IV Execution units of an SM run at twice the frequency of fetch and issue logic as well as memory and register 64KB register file that is partitioned across alls SPs 16KB shared memory that can be used for communication between the threads running on the SPs of the same SM organized in 4096 entries 16 banks 32bit bank width accessing shared memory is as fast as accessing a register Parallel Computations Edgar Gabriel Load Store operations Generated in SMs but handled by SM controller in the TPC load pipeline shared hardware with texture pipeline shared by three 3 SMs mutual exclusive usage of load and texture pipelines effective address calculation mapping of 40byte virtual addresses to physical address by MMU Texture cache 2 D addressing read only caches without cache coherence entire cache hierarchy invalidated if a data item is modified texture caches used to save bandwidth and power not really faster than texture memory Parallel Computations Edgar Gabriel 9 Load Store operations II Parallel Computations Edgar Gabriel Image Source 2 Generalized Memory Model Parallel Computations Edgar Gabriel 10 CUDA Memory Model II cudaError t cudaMalloc void devPtr size t size Allocates size bytes of device global memory pointed to by devPtr Returns cudaSuccess for no error cudaError
View Full Document
Unlocking...