An FPGA Implementation of the Fast Minimum-Redundancy Prefix CodingIntroductionExperimental SetupTechniques UsedSimple Dual-Port Block RAM in FPGAComplex Control/Conditional LoopPerformance SummaryPerformance Summary (cont)VerificationConclusionFuture WorkReferencesAn FPGA Implementation of the Fast Minimum-Redundancy Prefix CodingChia-Hsiung (Eric) ChenThis work is co-advised by Dr. Bormin Huang of Space Science and Engineering Center (SSEC), UW-MadisonIntroduction• Minimum-redundancy prefix code– Linear-time, prefix-free entropy coder• Does not require sending of codeword table during data transmission– Real-time space data compression– Low complexity, high compression ratio, and lossless– 5 stage design5/4/2009 2Experimental Setup• Transform Matlab minimum-redundancy prefix coder to FPGA hardware implementation• Target FPGA: Altera Stratix III EP3SE50F484C2• Language used: Verilog• Software environment:– Quartus II 9.0 web ed. (synthesis & timing analysis)– ModelSim SE 6.3f (functional & post P&R simulation)5/4/2009 3Techniques Used• Algorithm transformation• Incorporate simple dual port block RAM to enable parallelism (1R, 1W) -> throughput– Bypass logic• Pipeline to reduce cycle time• Combined condition evaluation• Trading cycle # for cycle time– 256 cycle out of ~9000 cycles for 200MHz->250MHz• Special logic to improve/reduce worst/average case cycle countrd_addr_iwr_addr_idata_iclk_p_idata_owen_p_i5/4/2009 4Simple Dual-Port Block RAM in FPGA• RAM operation characteristics• Efficient scheduling with bypassingAddrA AddrB AddrCRd_AddrData OutDataA DataBR D/M WR D/M WR D/M WR D/M WR D/M W5/4/2009 5Complex Control/Conditional Loop5/4/2009 6Performance SummaryBlock Stage Worst case Latency DescriptionC0 - 1024 3 input 2048 samplesC1a 2048 4b 256 2c 1025* 3 < max(freq)d 256 4 max(non-zero symbol)e 256 4 # of non-zero symbolsC2a 256*4 5+4if(# of non-zero symbols) > 2 (# of non-zero symbols-2)*4b 256*2 2if(# of non-zero symbols) > 2 (# of non-zero symbols-2)*2c 256 3 # of non-zero symbolsC3 - 256 2 # of non-zero symbolsC4 - (2048) (?) ongoingEstimated total cycle 8997+(?)5/4/2009 7Performance Summary (cont)• Current maximum working frequency: ~250MHz– From C0 to C3• Estimated throughput: ~56 Msamples/s– Time required for 2048 samples (worst case) = 9000/250M = 36 μs– Throughput = 1/(36 μs)×2048 = 56.89 Msamples/s• As compared to 1/((9000-256-2)/200M) × 2048 = 46.85M5/4/2009 8Verification• Both functional and post P&R gate-level simulation are performed to verify the designShould remain 0!Stream-in answer5/4/2009 9Conclusion• Brief intro to minimum-redundancy prefix coder• High performance hardware implementation of the minimum-redundancy prefix coder– Experimental setup and techniques used in hardware implementation– Performance evaluation5/4/2009 10Future Work• Complete C4• Evaluate incorporation of true dual port RAM– Enable 2R/2W operation– Unlock loop unroll option and (possibly) reduce cycle count– Complex bypass circuit? Increased cycle time?5/4/2009 11References• B. Huang, “Fast Minimum-redundancy Prefix Coding for Real-Time Space Data Compression," Proc. SPIE, 6683, 66830H, 2007.• Lossless Data Compression. Recommendation for Space Data System Standards, CCSDS 121.0-B-1. Blue Book. Issue 1. Washington, D.C.: CCSDS, May 1997.• D.A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proc. Inst. Radio Eng., vol. 40, iss. 9, pp. 1098-1101, Sept. 1952.5/4/2009
View Full Document