Mapping DSP algorithms to a general purpose out-of-order processorOutlineIntroductionOut-of-order execution overviewUnderstanding DG change for mapping to GPPTo-do listExpected ResultsMapping DSP algorithms to a general purpose out-of-order processorECE 734Ilhyun KimDonghyun BaikOutlineIntroductionOut-of-order execution overviewDependence graph change for mapping to GPPTo-do listExpected resultsIntroductionWhy DSP applications are implemented on GPP–Lower development cost•commodity part, lower maintenance cost, faster turn-around–GPP meets performance requirement•GPP becomes faster. Only DSP chips could do it in the pastProblems with algorithm transformations–Used for faster operations / efficient hardware–On GPP, no control over hardware configurations–Some of them are effective while others notProblems with extracting parallelism–Duplicated efforts on each of layers (source code, compiler, processor)–Narrow machine scope over independent operationsWhat are the efficient ways to map algorithm to GPP?–Understanding how a GPP executes instructions–How can software improve the performance?Out-of-order execution overviewDynamic parallelism extraction–Instruction re-ordering–Dynamically searching independent operations within a limited scope–Trying to keep all available hardware resources busyfor(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) +c(i-1,j)functionalunitsUnderstanding DG change for mapping to GPPfor(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j)for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j)for(i=1~m) for(j=1~n) c(i,j)=c(i,j-1) + c(i-1,j)computationmem accesscontrol-relatedinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowinstruction windowTo-do listInfrastructure–Building a perfect machine model simulator that keeps track of only computations in the algorithm, measuring ideal execution time of a compiled binary assuming perfect parallelism–Building a profile tool that locates an instruction that we are interested in among instructions in the binaryThe effect of single assignment–Characterizations on various machine configurationsThe effect of unfoldingThe effect of SIMD parallelismOptimization techniques for the Alpha architecture based on characterization dataOptimizing an existing DSP application: MPEG-2 decoderExpected ResultsSingle assignment transformation doesn’t work–Rather, try to recycle storage space whenever possible–HW-based single assignment is performedUnfolding transformation–It works on iteration-independent loops w/ trivial computations–It reduces the loop indices overhead (even on iteration-dependent loops)–Do not unfold loops w/ non-trivial computationsSIMD parallelism to reduce memory communications–Alpha doesn’t support SIMD instruction sets but it has 64-bit datapath and instructions–read/write 64 bits at a time by splitting/merging narrow words–There are more computing units (4) than memory units (2) : reducing memory operations helps performancePerformance improvement of MPEG-2 decoder based on the optimizations that we
View Full Document