SBU CSE 591 - Basics on Architecture and Programming - D585067

Home> Schools> Stony Brook University> Computer Science (CSE) > CSE 591> Basics on Architecture and Programming

SBU CSE 591 - Basics on Architecture and Programming

Course Cse 591- Topics in Computer Science

Pages 13

Download Save

Unformatted text preview:

CSE 591: GPU ProgrammingBasics on Architecture and ProgrammingKlaus MuellerComputer Science DepartmentStony Brook UniversityRecommended Literaturetext book reference bookmore general books on parallel programingprogramming guidesavailable from nvidia.comCourse Topic Tag CloudArchitectureLimits of parallel programmingThread managementMemoryDevice controlAlgorithmsOpenCLExample applicationsPerformance tuningDebuggingCUDAHostKernelsParallel programmingCourse Topic Tag CloudArchitectureLimits of parallel programmingThread managementMemoryKernelsDevice controlAlgorithmsOpenCLExample applicationsPerformance tuningDebuggingHostCUDAParallel programmingSpeedup Curves Speedup Curvesbut wait, there is more to this…..Amdahl’s LawGoverns theoretical speedupP: parallelizable portion of the programS: speedupN: number of parallel processorsNPPSPPSparallel+−=+−=)1(1)1(1Amdahl’s LawGoverns theoretical speedupP: parallelizable portion of the programS: speedupN: number of parallel processorsP determines theoretically achievable speedup• example (assuming infinite N): P=90% Æ S=10P=99% Æ S=100NPPSPPSparallel+−=+−=)1(1)1(1Amdahl’s LawHow many processors to use•when P is small Æ a small number of processors will do• when P is large (embarrassingly parallel) Æ high N is useful Focus Efforts on Most BeneficialOptimize program portion with most ‘bang for the buck’•look at each program component • don’t be ambitious in the wrong placeFocus Efforts on Most BeneficialOptimize program portion with most ‘bang for the buck’•look at each program component • don’t be ambitious in the wrong placeExample:• program with 2 independent parts: A, B (execution time shown)• sometimes one gains more with lessABOriginal programB sped up 5×A sped up 2×Beyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesBeyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesMemory access patterns • data access locality and strides vs. memory banksBeyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesMemory access patterns • data access locality and strides vs. memory banksMemory access efficiency• arithmetic intensity vs. cache sizes and hierarchies Beyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesMemory access patterns • data access locality and strides vs. memory banksMemory access efficiency•arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism •MIMD vs. SIMDBeyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesMemory access patterns • data access locality and strides vs. memory banksMemory access efficiency•arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism •MIMD vs. SIMDHardware support for specific tasks Æ on-chip ASICSBeyond Theory....Limits from mismatch of parallel program and parallel platform•man-made ‘laws’ subject to change with new architecturesMemory access patterns • data access locality and strides vs. memory banksMemory access efficiency• arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism • MIMD vs. SIMDHardware support for specific tasks Æ on-chip ASICSSupport for hardware access Æ drivers, APIsDevice Transfer Costs Transferring the data to the device is also important•computational benefit of a transfer plays a large role• transfer costs are (or can be ) significantTransferring the data to the device is also important• computational benefit of a transfer plays a large role• transfer costs are (or can be ) significantAdding two (N×N) matrices:•transfer back and from device: 3 N2elements• number of additions: N2Æ operations-transfer ratio = 1/3 or O(1)Device Transfer Costs Transferring the data to the device is also important• computational benefit of a transfer plays a large role• transfer costs are (or can be ) significantAdding two (N×N) matrices:•transfer back and from device: 3 N2elements• number of additions: N2Æ operations-transfer ratio = 1/3 or O(1)Multiplying two (N×N) matrices:•transfer back and from device: 3 N2elements• number of multiplications and additions: N3Æ operations-transfer ratio = O(N) grows with NDevice Transfer CostsConlcusions: Programming StrategyUse GPU to complement CPU execution•recognize parallel program segments and only parallelize these• leave the sequential (serial) portions on the CPUsequential portions (do not bite)parallel portions (enjoy)PPP (Peach of Parallel Programming – Kirk/Hwu)Course Topic Tag CloudArchitectureLimits of parallel programmingThread managementMemoryKernelsDevice controlAlgorithmsOpenCLExample applicationsPerformance tuningDebuggingCUDAHostParallel programmingOverall GPU Architecture (G80)Load/storeGlobal MemoryThread Execution ManagerInput AssemblerHostTextureTexture Texture Texture Texture Texture Texture TextureTextureParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheLoad/storeLoad/store Load/store Load/storeLoad/storeStream processor SPSM blockStreaming multi-processor SM768 MB Off-chip (GDDR) DRAM (on-board)Memory bandwith: 86.4 GB/s (GPU)4GB/s BW (GPU ↔ CPU, PCI Express)GPU Architecture SpecificsAdditional hardware•each SP has a multiply-add (MAD) and one extra multiply unit• special floating-point function units (SQRT, TRIG, ..)Massive multi-thread support•CPUs typically run 2 or 4 threads/core• G80 can run up to 768 threads/SM Æ 12,000 threads/chip• G200 can run 1024 threads/SM Æ 30,000 threads/shipG80 (2008)•GeForce 8-series (8800 GTX, etc)• 128 SP (16 SM × 8 SM)• 500 Gflops (768 MB DRAM)G200 (2009)•GeForce GTX 280, etc• 240 SP• 1 Tflops (1 GB DRAM)NVIDIA Quadro:professional version of consumer GeForce seriesNVIDIA Fermi ArchitectureNew GeForce 400 series•GTX 480, etc• up to 512 SP (16 × 32) but typically < 500 (GTX 480 has 496 SP)• 1.3 Tflops (1.5GB DRAM)Important features:•C++, support for C, Fortran, Java, Python, OpenCL,

View Full Document


School:
Email:
New Password:
Confirm Password:

SBU CSE 591 - Basics on Architecture and Programming

Sign up for free to view:

Please select your school