Cell Programming Tips & TechniquesClass Objectives – Things you will learnClass AgendaReview Cell ArchitectureCell ProcessorCourse AgendaKey SPE FeaturesSPE – Single-Ported Local MemorySPU Programming TipsSlide 10Programming Levels on Cell BEOverlap DMA with computationStart DMAs from SPUInstruction SchedulingInstruction Starvation SituationInstruction Starvation PreventionDesign for Limited Local StoreBranch OptimizationsBranchesSlide 20Hinting Branches & Instruction Starvation PreventionLoop UnrollingLoop Unrolling - ExamplesSPUSPU – Software PipelineInteger MultipliesAvoid Scalar CodeChoose an SIMD strategy appropriate for your algorithmChoose SIMD strategy appropriate for algorithmSIMD ExampleLoad / Store by QuadwordSIMD Programming TipsSlide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Use Offset PointerShuffle byte instructions for table look-upsSlide 43Systems and Technology GroupCell Programming Tips & Techniques 1Cell Programming Tips & TechniquesCell Programming WorkshopCell Ecosystem Solutions EnablementSystems and Technology GroupCell Programming Tips & Techniques 2Class Objectives – Things you will learnKey programming techniques to exploit cell hardware organization and language features for–SPU–SIMDSystems and Technology GroupCell Programming Tips & Techniques 3Class AgendaReview relevant SPE FeaturesSPU Programming Tips–Level of Programming (Assembler, Intrinsics, Auto-Vectorization)–Overlap DMA with computation (double, multiple buffering)–Dual Issue rate (Instruction Scheduling)–Design for limited local store–Branch hints or elimination–Loop unrolling and pipelining–Integer multiplies (avoid 32-bit integer multiplies)–Shuffle byte instructions for table look-ups–Avoid scalar code–Choose the right SIMD strategy–Load / Store only by quadwordSIMD Programming TipsSystems and Technology GroupCell Programming Tips & Techniques 4Review Cell ArchitectureSystems and Technology GroupCell Programming Tips & Techniques 5Cell ProcessorSystems and Technology GroupCell Programming Tips & Techniques 6Course AgendaCell Blade ProductsCell Blade Family of ServersCell Blade ArchitectureCell Blade Overview–Critical signals, link speed and bandwidth–Power consumption–Hardware componentsBlade and blade center assemblyExample of a cell blade with maximum interconnection capabilityOptions - InfinibandTrademarks: Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc.References: Dan Brokenshire, BE Programming TipsSystems and Technology GroupCell Programming Tips & Techniques 7Key SPE FeaturesSystems and Technology GroupCell Programming Tips & Techniques 8SPE – Single-Ported Local MemorySystems and Technology GroupCell Programming Tips & Techniques 9SPU Programming TipsSystems and Technology GroupCell Programming Tips & Techniques 10SPU Programming TipsLevel of Programming (Assembler, Intrinsics, Auto-Vectorization)Overlap DMA with computation (double, multiple buffering)Dual Issue rate (Instruction Scheduling)Design for limited local storeBranch hints or eliminationLoop unrolling and pipeliningInteger multiplies (avoid 32-bit integer multiplies)Shuffle byte instructions for table look-upsAvoid scalar codeChoose the right SIMD strategyLoad / Store only by quadwordSystems and Technology GroupCell Programming Tips & Techniques 11Programming Levels on Cell BEExpert level–Assembler, high performance, high effortsMore ease of programming–C compiler, vector data types, intrinsics, compiler schedules instructions + allocates registersAuto-SIMDization–for scalar loops, user should support by alignment directives, compiler provides feedback about SIMDizationHighest degree of ease of use–user-guided parallelization necessary, Cell BE looks like a single processorTrade-OffPerformance vs. EffortRequirements for Compiler increasing with each levelSystems and Technology GroupCell Programming Tips & Techniques 12Overlap DMA with computationDouble or multi-buffer code or (typically) dataExample for double bufferign n+1 data blcoks:–Use multiple buffers in local store–Use unique DMA tag ID for each buffer–Use fence commands to order DMAs within a tag group –Use barrier commands to ordr DMAs within a queueSystems and Technology GroupCell Programming Tips & Techniques 13Start DMAs from SPUUse SPE-initiated DMA transfers rather than PPE-initiated DMA transfers, because–there are more SPEs than the one PPE–the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16Systems and Technology GroupCell Programming Tips & Techniques 14Instruction SchedulingSystems and Technology GroupCell Programming Tips & Techniques 15Instruction Starvation Situationinstruction buffersFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMThere are 2 instruction buffers–up to 64 ops along the fall-through pathFirst buffer is half-empty–can initiate refillWhen MEM port is continuously used –starvation occurs (no ops left in buffers)Dual-IssueInstructionLogicDual-IssueInstructionLogicinitiaterefillafter halfemptySystems and Technology GroupCell Programming Tips & Techniques 16Instruction Starvation Preventioninstruction bufferFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMFP MEMSPE has an explicit IFETCH op–which initiates an instruction fetchScheduler monitors starvation situation–when MEM port is continuously used–insert IFETCH op within the (red) windowCompiler design–scheduler must keep track of code layoutDual-IssueInstructionLogicDual-IssueInstructionLogicinitiaterefillafter halfemptyrefill IFETCH latencybeforeit is toolate tohidelatencySystems and Technology GroupCell Programming Tips & Techniques 17Design for Limited Local StoreThe Local Store holds up to 256 KB for–the program, stack, local data structures, and DMA buffers.Most performance optimizations put pressure on local store (e.g. multiple DMA buffers)Use plug-ins (runtime download program kernels) to build complex function servers in the LS.Systems and Technology GroupCell Programming Tips & Techniques 18Branch OptimizationsSPE–Heavily pipelined high penalty for branch misses (18
View Full Document