OPTIMIZATION OF H 264 BASELINE DECODER ON ARM9TDMI PROCESSOR by SANDYA BASAVANAHALLI SHESHADRI Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN ELECTRICAL ENGINEERING THE UNIVERSITY OF TEXAS AT ARLINGTON December 2005 ACKNOWLEDGEMENTS I am grateful to lot of people who helped me to shape up this thesis I am greatly indebted to Dr K R Rao for his support guidance and encouragement right from the beginning of my work His courses Digital Image Processing and Video Coding Standards made my concepts about video processing much clearer He also helped me in getting the industry experience in FastVDO LLC Columbia MD where I am gaining a lot real world experience I thank him for all the support Dr Pankaj Topiwala president of FastVDO LLC was generous to allow me to run tests on FastVDO s Baseline decoder and let me utilize all the resources and licenses for ADSv1 2 without which this work would not have been in reality I am greatly thankful to him I received many helpful comments and suggestions from Patrick Rault Basavaraj Mudigoudar Sachin Patil and Tarun Batia who are my colleagues in FastVDO I thank Dr Devarajan and Dr Wang for accepting to be my committee members and review my thesis Last but not the least I thank my parents and my brother for their love and support in every walk of my life November 18 2005 ii ABSTRACT OPTIMIZATION OF H 264 BASELINE DECODER ON ARM9TDMI PROCESSOR Publication No Sandya Basavanahalli Sheshadri MS The University of Texas at Arlington 2005 Supervising Professor Dr K R Rao With the newly introduced features and advancements to the pre existing features the emerging H 264 video coding standard achieves significant improvements in coding performance over all existing standards in a wide variety of applications The coding efficiency advantages of H 264 however come at the expense of higher computational complexity H 264 decoders can exhibit more than double the complexity of H 263 decoders Furthermore previous studies have shown that fractional pixel motion compensation interpolation and the loop filtering consume a significant amount of computational power in emerging H 264 decoders Since these operations are part of the baseline profile of H 264 there is a need to evaluate new ways for minimizing complexity for H 264 decoders on low complexity devices In particular new wireless iii devices have both complexity and bit rate constraints yet the range of these constraints differ from traditional systems e g powerful PCs that are networked over the besteffort Internet Under common operational scenarios a low complexity wireless handheld may have significantly greater complexity power constraints than bit rate limitation e g over a wireless access LAN This thesis analyzes the bottlenecks of H 264 decoders on ARM9TDMI processor targeted for mobile devices using performance profiling tools Optimizations are performed to achieve real time decoding The code is built with Real View Compiler for ARM and ported on Symbian using Metroworks Codewarrior for Symbian V3 0 to achieve real time H 264 decoding on Nokia 6630 cellphone The compiler flags were optimized for speed iv TABLE OF CONTENTS ACKNOWLEDGEMENTS ii ABSTRACT iii LIST OF ILLUSTRATIONS viii LIST OF TABLES x LIST OF ACRONYMS xi Chapter 1 INTRODUCTION 1 1 1 Overview H 264 Video Coding Standard 1 1 2 Applications and Design Feature Highlights 2 1 3 Layered Structure 8 1 3 1 Network Abstraction Layer 9 1 3 2 Video Coding Layer 10 1 3 3 YCbCr color space and 4 2 0 sampling 11 1 3 4 Division of the picture into macroblocks 12 1 3 5 Slices and slice groups 12 1 4 H 264 Codec 15 1 4 1 Encoder forward path 17 v 1 4 2 Encoder reconstruction path 18 1 4 3 Decoder 18 1 5 Intra frame Prediction 19 1 6 Inter frame Prediction 23 1 6 1 Inter frame Prediction in P Slices 23 1 7 Transform Scaling and Quantization 29 1 8 Entropy Coding 31 1 9 In Loop Deblocking Filter 31 2 ARM9TDMI 34 2 1 About the ARM9TDMI 34 2 2 Programmer s Model 36 2 2 1 Hardware Fundamentals 36 2 2 2 Instruction set extension spaces 36 2 2 3 Pipeline implementation and interlocks 36 2 2 4 Memory Interface 38 3 DESIGNING AND OPTIMIZING FOR ARM9TDMI 39 3 1 Optimization Techniques 39 3 1 1 Basic C Data Types 39 3 1 2 Conditional Execution 41 3 1 3 Comparison with zero 43 3 1 4 Loops 44 vi 3 1 5 Register Allocation 47 4 H 264 AVC BASELINE PROFILE DECODER COMPLEXITY ANALYSIS 49 4 1 Introduction 49 4 2 Experiment Setup 50 4 3 Experimental Analysis 52 4 3 1 Results of Performance analysis of H 264 Decoder 52 4 3 2 Optimization of IDCT 54 4 3 3 Optimizing inter frame prediction 58 4 3 4 Optimization of De blocking Filter 64 5 CONCLUSIONS AND FUTURE WORK 67 Appendix A IMPLEMENTATION STEPS OF 4X4 IDCT 69 REFERENCES 73 BIOGRAPHICAL INFORMATION 78 vii LIST OF ILLUSTRATIONS Figure Page 1 1 Profiles in H 264 22 7 1 2 Layered Structure of H 264 Video 16 9 1 3 Progressive and interlaced frames and fields 16 11 1 4 4 2 0 sampling 39 12 1 5 Subdivision of a picture into slices when not using FMO 16 13 1 6 Subdivision of a QCIF frames into slices when using FMO 16 14 1 7 AVC Encoder 9 16 1 8 AVC Decoder 9 16 1 9 Left Intra 4x4 prediction is conducted for samples a p of a block using samples A Q Right 8 prediction directions for Intra 4x4 prediction 16 21 1 10 Five of the nine Intra 4x4 prediction modes 16 22 1 11 Segmentations of the macroblock for motion compensation Top segmentation of macroblocks bottom segmentation of 8x8 partitions 16 23 1 12 Filtering for fractional sample accurate motion compensation Upper case letters indicate samples on the full sample grid while lower case samples indicate samples in between at fractional sample positions 16 26 1 13 Multi frame motion compensation 16 27 viii 1 14 Repeated transform for chroma blocks The four blocks numbered 0 to 3 indicate the four chroma blocks of a chroma component of a macroblock 30 1 15 Principle of deblocking filter 9 33 1 16 Performance of the deblocking filter for highly compressed pictures Left without deblocking filter right with deblocking filter 16 33 2 1 Processor Block Diagram 37 35 2 2 ARM9TDMI processor core instruction pipeline 37 37 4 1 Video sequences used for performance analysis 39 51 4 2 Percentage of the total execution time spent in major subroutines for Foreman QCIF sequence 53 4 3 Fast implementation of the H 264 inverse transform No multiplications are needed only
View Full Document