UCSD CSE 231 - Optimizing Compiler for a CELL Processor - D98397

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 231> Optimizing Compiler for a CELL Processor

DOC PREVIEW

UCSD CSE 231 - Optimizing Compiler for a CELL Processor

School name University of California, San Diego

Course Cse 231- Advanced Compilers

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Optimizing Compiler for a CELL ProcessorCritical Review byDaniel TracyCELL AnatomySPE SPE SPE SPESPE SPE SPE SPEElement Interconnect BusL2PPESPEs vs Multi-coreSPE SPE256K256KCoreCoreExplicitCommunicationMain MemoryHardware CoherenceLoads/StoresVector Operationsb b b b b b b b b b b b b b b bh h h h h h h hinteger integer integer integerfloat float float floatdouble doubleVector Operations128-bit Data Register128-bit Data Registerv0v127..128-bit Data Registerv1128-bit Data Registerv2Vector Operations128-bit Data Register128-bit Data Registerv0v127..128-bit Data Registerv1128-bit Data Registerv2256K Local Memory128-bits widecomputation vector permutecommunicationbranchLimited Dual-Issue:Slot 1 Slot 2Goals of XL Compiler•Compile C/Fortran code into parallel multi-SPE code•Utilize OpenMP pragmas•Insert run-time system to mimic shared memory•Increase efficiency of SPE utilization•Utilize traditional vector compilation techniques•Force scalar code to execute on vector unitsPartition Manager•Code and data must share a small footprint•Traditional code overlay system employed256K Local MemoryExternal MemoryCode Partition•Cache operations performed by vector unit!•More than twofold common-case overheadData Locality: Software Caching256K Local MemoryTagCache AddressActual AddressLocal AddressMemory Representation: Scalar vs VectorUniform Data Typesvs Arbitrary Types16-byte alignedinteger integer integer integerh hinteger integerfloat(e.g., structs, stack, etc.)Executing Scalar Code on Vector Units? integer ? ?integer ? ? ?primary slotprimary slot? ? integer ?integer ? ? ?primary slotprimary slotAlignment: Operations between scalarsScalar store: Masking Into Destinationresult-4 result result+4 result+8Avoiding this Insanity?•Only applies to code we fail to autovectorize•Only applies to scalars we can’t register-allocate•Allocate 128 bits for all variablesMiscellany•Scheduling optimizations must account for limited dual-issue behavior•Branch optimization involves injection of branch hint instructions for loops•Autovectorization techniques utilized for SPE code generationThe techniques described for these problems are merely the application of previous work.Experiments: Base Optimizations0.40.50.60.70.80.91.0HuffmanFFTIDEALUVLDLinpackConvolutionOnerayXYMat MultSaxpyAverageRelative reductions in execution timeOriginal +Bundle +Branch Hint + IfetchFigure 6: SPE optimizations.In our code partitioning approach, the SPE program isdivided into multiple partitions by the com piler. The homelocations of code partitions, just as with data in our SoftwareCache approach, is system memory. When the compiler en-counters such compilations it reserves a small portion of theSPE local storage for the code partition manager. The re-served memory is divided into two segments: one to holdthe cont inuously resident partition manager, while the otherholds the cur rent active code partition. The partition man-ager is responsible for loading partitions from th eir homelocation in system memory int o local storage when neces-sary, normally during an inter-partition function call or aninter-partition return. The compiler modifies the originalSPE program to replace each inter-partition call with a callto th e partition manager. Thus, the partition manager isable to take over c ontrol and handle the transition from thecurrent partition to the target partition. The partition man-ager also makes sure an int er-partition return will return tothe partition manager first.Currently, the partitioning algorithm is a call graph basedone, which means the basic unit of partit ioning is a func-tion. The compiler transforms the call graph int o an affinitygraph, with edge weight representing call edge frequency,and then applies a maximum spanning tree algorithm un-der a certain resource limit, typ ically the (adju stable) codebuffer size.In the code partitioning, currently we see respectable per-forman ce when executing partitioned functions on a singleSPE relative to execution on the PPE. In the automotivesuite of th e EEMBC benchmark, across a single SPE, wesee a slowdown of between 2 - 10 %. CJPEG, with code anddata sizes of 1M, slows down 2.7 times, with both softwarecache and code partitioning en abled. Given the preliminarynature of t his work, these results are encouraging.There are several opportunities which we are currentlyexploring to improve the overall perform ance of our codepartitioning algorith m. Effectively the algorithm largely de-pends on the accuracy of the affinity (call ed ge frequency) .To achieve the best results, pr ofiling can be used insteadof static estimation. Also, using t he actual partition sizerather than the size estimated in the compiler conserva-tively could improve the utilizat ion of local code buffer sig-nificantly. Prefetching is of course the most promising opti-mization and h as the potential to hide the latency incu rredwhen fetching partitions from main memory. Bu t prefetch-ing requires multiple buffers implying a much smaller par-2.42.52.92.97.58.111.425.326.29.9051015202530LinpackSwim-l2FIRAutcorDot ProductChecksumAlpha BlendingSaxpyMat MultAverageSpeedup factorsFigure 7: Simdization speedups.tition size limit. The net effect depends on the prefetchingalgorithm and the accuracy of the cost model applied.7. MEASUREMENTSWe first evaluate the optimized SPE code generation tech-niques presented in Section 4 using a cycle-accurate simula-tor. Figure 6 presents the reduction in program executiontime for each optimization, relative to the performance ofthe original compiler (standard optimizations at O3 level,sched uled for the SPE resource and latency model). We re-port an average reduction of 22%, ranging from 11 to 51%.The benchmark programs u sed here are highly optimized,simdized kernels representative of typical workloads execut-ing on the SPEs. Kernels include a variable length decod-ing (VLD) from MPEG decoding, a Huffman compressionand decompression, an IDEA encryption, and a ray tracing(OnerayXY ). Numerical kernels include an FFT, a 7x7 shortinteger convolution, a 64x64 fl oat matrix multiply, a Saxpy,an LU decomposition, and a solver kernel of Linpack.Bundling for dual issue results in an 11% average reduc-tion in execution time, ranging from 2 to 22%. Large reduc-tion percentages indicate benchmarks with large amounts ofinstruction-level parallelism and no lucky instruction align-ment (where random instru ct ion layout did not satisfy

View Full Document