U of U CS 5785 - Digital Signal Processors

Unformatted text preview:

 Better sharc data such as vliw format, number of kind of functional units Pictures of pipe would help Build up zero overhead loop example better FIR inner loop in coldfire Mine more material from bsdi.com Update costs for ADSP chips Tie in audio stuff to DSP stuff better DSP perf of various stages or similar?Today Digital signal processors VLIW SHARC details Quick look at audio processingDigital Signal Processors Microcontrollers are optimized for control-intensive apps Average general-purpose application branches every seven instructions Branches often not very predictableMemory accesses often not very predictableMemory accesses often not very predictable DSPs are optimized for math, loops, and data movement Both fixed-point and floating-point math Fast loop operations for simple loop structures Lots of I/O Instructions and memory accesses very predictableImportant DSPs Texas Instruments TMS320C2000, TMS320C5000, and TMS320C6000 Motorola StarCore: DSP56300, DSP56800, and MSC8100 Agere Systems DSP16000 series  Analog Devices SHARC: ADSP-2100 and ADSP-21000At the low end… DSP: All key arithmetic ops in 1 cycle GPP: Often some math (multiply at least) is multiple-cycleDSP: Support for 8 and 16 bit quantities as both DSP: Support for 8 and 16 bit quantities as both integers and fractions GPP: Fixed word size, integer only DSP: HW support for managing numerical fidelity Saturation, flexible rounding, etc. GPP: These are often implemented in SWAt the high end… DSP: Up to 8 arithmetic units GPP: 1-3 arithmetic units DSP: Highly specialized functional unitsMultiply and accumulate, Viterbi, etc.Multiply and accumulate, Viterbi, etc. GPP: General-purpose functional units Integer, floating point, etc. DSP: Very limited use of dynamic features Branch predication, superscalar, etc. GPP: Extensive use of dynamic featuresMore CPU vs. DSP DSPs are Harvard architecture even at the high end No high end CPU is Harvard architecture DSPs offer better cache control Lockable cache regionsCache can be turned into scratchpad RAMCache can be turned into scratchpad RAM• Scratchpad == explicitly addressable fast RAMSHARC High-performance DSP architecture Similarities to MCF52233 Separate instruction and data memories Some pipelining (3 stage vs. 4) SHARC is more CISC than ColdFire CISC main idea• Give people complex instructions that match what they are trying to do• This gives good performance and high code density SHARC • Instructions are highly specialized for DSPQuick VLIW Intro VLIW == Very Long Instruction Word Aggressive superscalar, out-of-order processors like P4 and Athlon Single operation per instruction Get high IPC through superscalar and out-of-order executionexecution Requires lots of logic (and energy) to detect and avoid problematic dependencies VLIW Dependencies detected and avoided at compile time VLIW can get high IPC with simpler HW Compiler technology is difficult Also, compiler becomes very sensitive to the architectural details and program structureMore SHARC Stuff Supports saturating ALU operations Can issue some computations in parallel Dual add-subtract Multiplication and dual add/subtract Floating-point multiply and ALU operationExample SHARC instruction:Example SHARC instruction: R6 = R0*R4, R9 = R8 + R12, R10 = R8 - R12;Parallelism Example We want to compute: if (a>b) y = c-d; else y = c+d; Strategy: Compute both results in parallel and then pick the right one! Load values (DM == data memory)! Load values (DM == data memory)R1=DM(_a); R2=DM(_b);R3=DM(_c); R4=DM(_d);! Compute both sum and differenceR12 = R2+R4, R0 = R2-R4;! Choose which one to saveCOMP(R1,R2);IF LE R0=R12;DM(_y) = R0 ! Write to ySHARC Addressing Immediate value R0 = DM(0x20000000); Direct load R0 = DM(_a); ! Loads contents of _a Direct store DM(_a)= R0; ! Stores R0 at _a Post-modify with update Used to sweep through a buffer I register holds base address M register/immediate holds modifier value R0 = DM(I3,M3) ! Load DM(I2,1) = R1 ! StoreData in Program Memory Can put constant data in program memory to read two values per cycle:F0 = DM(M0,I0), F1 = PM(M8,I9); Compiler allows programmer to control which memory values are stored inCircular Buffers Fundamental data structure for DSP New sample always overwrites oldest sampleSample 523Sample 524Sample 525Sample 523Sample 524Sample 525Sample 525Sample 526Sample 519Sample 520Sample 521Sample 522Sample 525Sample 526Sample 527Sample 520Sample 521Sample 522Read sample527 from ADCSHARC Circular Buffers Uses special Data Address Generator registers: L register gets buffer size B register buffer base address I, M registers in post-modify mode I is automatically wrapped around the circular buffer when it reaches B+Lreaches B+LSHARC Zero Overhead Loop No cost for jumping back to start of loop Hardware decrements counter, compares, then jumps backLoop lengthLast instructionIn loopTermination condition (Loop Counter Expired)LCNTR=30, DO L UNTIL LCE;R0=DM(I0,M0), F2=PM(I8,M8);R1=R0-R15;L: F4=F2+F3; Nested loops also handled HW provides a 6-deep loop counter stackFIR in Detail1. Obtain sample from ADC, generate interrupt2. Move the sample into the input circular buffer3. Update the pointer for the circular buffer4. Zero the accumulator5. Loop through all coefficients1.Fetch coefficient from coefficient circular buffer1.Fetch coefficient from coefficient circular buffer2. Update pointer to coefficient circular buffer3. Fetch sample from input circular buffer4. Update the pointer to the input circular buffer5. Multiply coefficient and sample6. Add result to accumulator6. Move output sample to a holding buffer7. Move output sample from holding buffer to DACFIR Inner Loop in Cint fir_inner (void){int i, f;for (i=0, f=0; i<N; i++)f = f + c[i]*x[i];f = f + c[i]*x[i];return f;}FIR Inner in SHARC! loop setup I0=a; ! I0 points to a[0] M0=1; ! set up increment I8=b; ! I8 points to b[0] M8=1; ! set up postincrement mode ! loop body LCNTR=N, DO loopend UNTIL LCE; LCNTR=N, DO loopend UNTIL LCE; R1=DM(I0,M0), R2=PM(I8,M8); R8=R1*R2;loopend: R12=R12+R8;FIR Inner in ColdFirefir_inner:link a6,#0moveq #0,d2moveq #0,d0lea


View Full Document

U of U CS 5785 - Digital Signal Processors

Download Digital Signal Processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Digital Signal Processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Digital Signal Processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?