Finding Body Parts with Vector ProcessingIntroductionDemoVision AlgorithmsLimb FindingAlgorithm specificsSlide 7Slide 8GPUGPU dataflow modelFragment processor has high resource limitsThe algorithmResultsResults – log scaleSlide 15Slide 16CommentsAcknowledgementsFinding Body Parts with Vector ProcessingCynthia BruynsBryan FeldmanCS 252IntroductionTake existing algorithm for tracking human motion, speed up by computing on the GPU.Demonstrate that many vision algorithms are prime candidates for using vector processingResults after false candidates have been removedDemoVision AlgorithmsOften computationally expensive-searching over many pixels for objects at many orientations and scalesE.g. •[((1024x768)pix)x3colors]x[12orientations]x[5 scales]Very often the case that highly parallizableLimb FindingGoal – find candidate limbsLimbs look like long dark rectangles on light backgrounds or long light things on dark backgrounds1. Convolution with filterconvolve using FFT•Response indicates how much pixels go from low to high intensity•Convolve over all three color channels so as to not miss red – blue of same intensityAlgorithm specifics*x2. For every pixel location get respconv from “left” and “right”, put into new matrix resplimb Algorithm specifics-respconvxxrespconvxxresplimbAlgorithm specifics3. Find local maximums – for every pixel replace with max. of local neighbors. If resplimb=locMax it’s a max.50 .25 .40 .23.75 .41 .98 .75.11 .43 .15 .23.78 .34 .13 .15 .75 .98 .98 .98.75 .98 .98 .98.78 .98 .98 .98.78 .87 .23 .23 resplimblocMaxGPUIt’s a good choice because each operation is per pixel – SIMD-likeData stored in texture buffers equivalent to local cache Clean instruction set and developing interface language to exploit vector operationsJustify your gaming habitsGPU dataflow modelHardware supports several data types for bandwidth optimization, i.e. 32 bit floating point, half etc.Data passed to main memory stages via bindingApplicat ionFragmentProcessorAssembly &RasterizationFramebuferOperationsFramebuferTexturesVertexProcessorFragment processor has high resource limits1024 instructions512 constants or uniform parameters•Each constant counts as one instruction16 texture units•Reuse as many times as desiredNo branching•But, can do a lot with condition codesNo indexed reads from registers•Use texture reads insteadNo memory writesThe algorithmDraw invokes the fragment programsThe texture becomes a data structure – use two for framebuffers to avoid RAW hazzardsFFT Fragment programFFT Fragment programImageMaskConvolution ProgramCylinderProgramFind MaxProgramFor each orientation to searchResults0100200300400500600700256 512 1024im age s izetime scaleCPU origCPU FFTGPU(CPU-2.53 GHz P4GPU Nvidia FX5900)Mask size fixed (22x13) vary image size*Additional GPU optimizations possibleResults – log scale(CPU-2.53 GHz P4GPU Nvidia FX5900)Mask size fixed (22x13) vary image size42.7 sec252.1 sec*Additional GPU optimizations possibleResultsImage size fixed (512x512) vary mask sizeVarying mask sizes allow for varying limb sizes on same imageResultsComments GPU and image processing are a good matchTime to move memory from CPU to GPU is cumbersome – but can be overcomeNon-uniformity of installations, products, exact specifications are hearsayAcknowledgementsKenneth MorelandDeva RamananOkan
View Full Document