Implementation of DWT using SSE Instruction SetLifting based 2D-DWT2D DWT Matrices layoutOptimizationsOptimizations …ResultsResults …Slide 8Implementation of DWT using SSE Instruction SetMehta, AmiMuller, GillesLifting based 2D-DWTLifting1D Horizontal lifting1D Vertical liftingFixed point(9,7) tap biorthogonal filterLossy compressionHigh compression levels2D DWT Matrices layoutMallat StrategyUses an auxiliary matrix to store the results of the horizontal filtering. No memory scattering:Horizontal high and low frequency components are not interleaved in memory. It allows a better exploitation of the SIMD parallelism.OptimizationsCacheThe 2 matrices are aligned on the cache row size (128bits=16B) to allow data fetching in one cycle.Input and output matrices are juxtaposed in the memory to prevent conflicts in Direct Mapped cache. (Associativity conflict)access accessCache layout without alignment Cache layout with alignmentOptimizations …SIMD codeUsing SSE2Computes 4 pixels in parallel using fixed point arithmetic.Profiling C code showed that column transform and cache access caused the main bottleneck.In DWT intermediate values are reused, instead of recalculating we keep the intermediate computations.ResultsImage size of 1024 x 1024Profiling results done using VTune Analyzer©Cycles per uops improves from 3.38 to 2.28Improvement of 32.5%Results …Thank
View Full Document