Performance Analysis of H 264 Encoder on TMS320C64x and ARM 9E Nikshep Patil Project objectives Understand the major blocks H 264 encoder 2 Understand the Texas Instruments 16 TMS64x DSP architecture Understand the ARM 9E 18 DSP architecture Port the H 264 encoder on the two platforms Analyze the performance of the encoder on the two processors in terms of MIPS Identify and optimize the most computationally expensive blocks separately for both the DSP cores Achieve MIPS reduction of about 30 Part 1 H 264 encoder H 264 Encoder Profiles 4 Seven prominent profiles Baseline profile Main profile Extended profile High Profile High 10 Profile High 4 2 2 Profile High 4 4 4 Profile Profile structure of H 264 3 Fig 1 The specific coding parts of the profiles in H 264 3 H 264 Encoder baseline profile 4 Primarily designed for Low processing power platforms Error prone transmission environments Features Low on coding efficiency I and P slice coding Enhanced error resilience coding such as flexible macroblock ordering FMO and arbitrary slice ordering ASO and redundant slices RS Context adaptive variable length coding CAVLC Features not included in baseline profile B slices SI or SP slices Interlace coding tools Context adaptive binary arithmetic coding CABAC Baseline profile continued 3 Major applications video conferencing mobile video streaming Layers of H 264 encoder 7 The H 264 encoder is organized into two layers Network abstraction layer Packets containing integer number of bytes with a header Video coding layer NAL units Non video coding layer NAL units Video coding layer The coded video bitstream Video data hierarchy 4 Video data organized as Picture Slices Macroblocks Submacroblocks Blocks Pixels Pixel is the most basic building block of a digital image H 264 Encoder block diagram Fig 2 The block diagram of H 264 encoder 3 Encoding process The major encoding steps are Intra prediction Inter prediction Transform and quantization Entropy coding Deblocking filter Intra prediction Performed in pixel domain Prediction of pixel values as linear interpolations of pixels from the adjacent edges of neighboring macroblocks already decoded For luma samples the prediction block may be formed for each 4X4 subblock each 8X8 block or for a 16X16 macroblock 9 directional prediction modes for each 4X4 and 8X8 luma blocks 4 directional prediction modes for 16X16 luma block 4 directional prediction modes for chroma blocks Intra coding prediction modes for 4X4 blocks 3 Inter prediction Generates a predicted version of a rectangular array of pixels by choosing another similarly sized rectangular array of pixels from a previously decoded reference picture Macroblocks partitioned into smaller sub blocks A large partition size is appropriate for homogeneous areas of the frame and a small partition size is beneficial for detailed areas A 16X16 macroblock can be partitioned in four ways 16X16 16X8 8X16 or 8X8 the 8X8 sub block can be partitioned in four ways 8X8 8X4 4X8 or 4X4 Inter prediction Macroblock and sub macroblock partitions 3 Sub pixel motion compensation Sub pixel motion compensation provides significantly better compression performance than integer pixel compensation Increases complexity Increases coding efficiency at high bitrates and high video resolutions For luma component sub pixel samples at half pixel positions are generated first and are interpolated from neighboring integer pixel samples using a 6 tap FIR filter with weights 1 5 20 20 5 1 32 Quarter pixel samples produced using bilinear interpolation between neighboring half or integer pixel samples For the 4 2 0 video format 1 8 pixel samples are required for the chroma component These samples are linearly interpolated between integer pixel chroma samples Half pixel and quarter pixel interpolation 3 Integer transform This residual signal with spatial redundancy is split into 4X4 or 8X8 blocks The 4X4 transform removes the need for multiplications Hierarchical transform structure The 4X4 blocks are first transformed with integer DCT operation Then the DC coefficients of neighboring 4X4 transforms for the luma blocks are grouped into 4X4 blocks and transformed again by Hadamard transform A 4X4 Walsh Hadamard transform is used for luma DC coefficients for 16X16 Intra mode A 2X2 Walsh Hadamard transform is used for chroma DC coefficients Quantization The quantized signal Y is obtained from the input signal X using the relation Y X ROUND SF Qstep X is the input signal Y is the output signal Qstep is the quantization parameter The quantization parameter varies from 0 to 51 allowing a total of 52 quantization steps The scaling operations for the quantization step sizes are arranged with logarithmic step size increments An increment of Qstep by 6 corresponds to doubling of quantization step size Entropy coding The syntax elements other than the residual data are encoded by the Exp Golomb codes A more sophistical method CAVLC employed for coding the residual data In CAVLC inter symbol redundancies are exploited by switching VLC tables for various syntax elements depending on already transmitted coding symbols The increased adaptivity allows improved coding in comparison to schemes using a single VLC table However symbol probabilities greater than 0 5 are not handled efficiently This may prevent usage of symbols with a smaller alphabet size for coding of residual data Deblocking filter Two major sources that can introduce blocking artifacts in H 264 The block based transform in intra and interprediction coding and the coarse quantization of the transform coefficients The motion compensated prediction loop Deblocking filter continued The deblocking filter reduces the blocking artifacts in the block boundary The luma deblocking filter process performed on four 16 sample edges The chroma deblocking filter process performed on two 8 sample edges Deblocking filter Boundaries to be filtered in a MB Luma solid line Chroma dotted line 3 Part 2 Digital signal processors TI TMS320C64x DSP 9 32 bit DSP engine The C64x uses an advanced very long instruction word VLIW architecture called VelociTI 2 The architecture uses parallelism It contains multiple execution units running in parallel which allow them to perform multiple instructions in a single clock cycle Available up to 1 1 GHz clock speeds The Major blocks of the C64x CPU Two general purpose register files A and B Eight functional units L1 L2 S1 S2 M1 M2 D1 and D2 Two load from memory data paths LD1
View Full Document