DOC PREVIEW
MIT 6 375 - Hardware-Software Codesign

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Hardware Software Codesign Kermin Fleming Computer Science Artificial Intelligence Lab Massachusetts Institute of Technology Many slides produced by Arvind Myron King Man Cheuk Ng Angshuman Parashar March 14 2011 http csg csail mit edu 6 375 L12 1 Hello world int main int argc char argv int n atoi argv 1 for int i 0 i n i printf Hello world n return 0 module mkHello TOP LEVEL WIRES wires CHANNEL IFC channel mkChannel wires has a software counterpart Reg Bit 8 count mkReg 0 Reg Bit 5 state mkReg 0 rule init count 0 count channel recv state 0 endrule rule hello count 0 case state 0 channel send H 1 channel send e 2 channel send l 3 channel send l 16 count count 1 endcase if state 16 state state 1 else state 0 endrule endmodule March 14 2011 http csg csail mit edu 6 375 L12 2 Today s Lecture Case Study IMDCT Interfacing with HW Extracting Parallelism Automated Solutions March 14 2011 Bluespec Inc SCE MI Intel MIT LEAP RRR http csg csail mit edu 6 375 L12 3 Ogg Vorbis Pipeline Bits Stream Parser Residue Decoder Ogg Vorbis is a audio compression format roughly comparable to other compression formats e g MP3 AAC MWA Floor Decoder IMDCT Windowing Input is a stream of compressed bits Parsed into frame residues and floor predictions The summed frequency results are converted to time valued sequencies Final frames are windows to smooth out irregularities IMDCT takes the most computation PCM Output March 14 2011 http csg csail mit edu 6 375 L12 4 IMDCT Suppose we want to use Array imdct int N Array vx hardware to accelerate preprocessing loop FFT IFFT computation for i 0 i N i vin i convertLo i N vx i vin i N convertHi i N vx i do the IFFT vifft ifft 2 N vin postprocessing loop for i 0 i N i int idx bitReverse i vout idx convertResult i N vifft i return vout March 14 2011 http csg csail mit edu 6 375 L12 5 IMDCT Array imdct int N Array vx preprocessing loop for i 0 i N i vin i convertLo i N vx i vin i N convertHi i N vx i do the IFFT call the hardware vifft ifft 2 N vin call hw 2 N vin postprocessing loop for i 0 i N i Implement or find a hardware IFFT int idx bitReverse i How will the HW SW communication work vout idx convertResult i N vifft i How do we explore design alternatives return vout March 14 2011 http csg csail mit edu 6 375 L12 6 HW Accelerator in a system Communication via bus Software CPU Bus PCI Express March 14 2011 HW IFFT Accelerator 1 HW IFFT Accelerator 2 DMA transfer Accelerators are all multiplexed on bus http csg csail mit edu 6 375 Possibly introduces conflicts Fair sharing of bus bandwidth L12 7 The HW Interface SW calls turn into a set of memory mapped calls through Bus Three communication tasks setSize inputData outputData Bus PCI Express March 14 2011 Set size of IFFT Enter data stream Take output out http csg csail mit edu 6 375 L12 8 Data Compatibility Issue IFFT takes Complex fixed point numbers How do we represent such numbers in C and in RTL template typename F typename I struct FixedPt typedef struct F fract bit 31 0 fract I integer bit 31 0 integer FixedPt template typename T typedef struct struct Complex FixedPt rel T rel FixedPt img T img Complex FixedPt C Verilog March 14 2011 http csg csail mit edu 6 375 L12 9 Data Compatibility Keeping HW and SW representation is tedious and error prone Issues of endianness bit and byte Layout changes based on C compiler gcc vs icc vs msvc Some SW representation do not have a natural HW analog What is a pointer Do we disallow passing trees and lists directly Ideally translation should be automatically generated Let us assume that data compatibility issue have been solved and focus on control issues March 14 2011 http csg csail mit edu 6 375 L12 10 First Attempt at Acceleration Array imdct int N Array Complex FixedPt int int vx preprocessing loop for i 0 i N i vin i convertLo i N vx i Sets size vin i N convertHi i N vx i pcie ifc setSize 2 N Sends 1 element for i 0 i 2 N i pcie ifc put vin i for i 0 i 2 N i vifft i pcie ifc get Gets 1 element postprocessing loop for i 0 i N i int idx bitReverse i vout idx convertResult i N vifft i Software blocks until return vout response exists March 14 2011 http csg csail mit edu 6 375 L12 11 Exposing more details mem mapped hw register volatile int hw flag mem mapped hw frame buffer volatile int fbuffer Array imdct int N Array Complex FixedPt int int vx assert hw flag IDLE for cnt 0 cnt n cnt fbuffer cnt frame cnt hw flag GO while hw flag IDLE for cnt 0 cnt n 2 cnt frame cnt fbuffer cnt What happens if SW has a cache March 14 2011 http csg csail mit edu 6 375 L12 12 Issues Are the internal hardware conditions exposed correctly by the hw flag control register Blocking SW is problematic March 14 2011 Prevents the processor from doing anything while the accelerator is in use Hard to pipeline the accelerator Does not handle variation in timing well http csg csail mit edu 6 375 L12 13 Driving a Pipelined HW int pid fork if pid producer process while for i 0 i 2 N i pcie put vin i else consumer process while for i 0 i 2 N i v i pcie get March 14 2011 Multiple processes exploit pipeline parallelism in the IFFT accelerator How does the BSV exert back pressure on the producer thread How does the consumer thread exert back pressure on the BSV module What if our frames are really large could the HW begin working before the entire frame is transmitted http csg csail mit edu 6 375 L12 14 Data Parallelism 1 SyncQueue Complex workQ int pid fork both threads do same work while Complex FixedPt vin workQ pop for i 0 i 2 N i pcie put vin i How do we isolate each thread s use of the HW accelerator Do two synchronization points workQ and the HW accelerator cause our design to deadlock for i 0 i 2 N i v i pcie get March 14 2011 http csg csail mit edu 6 375 L12 15 Data Parallelism 2 PCIE get hw int pid if pid 0 return pcieA else return pcieB By giving each thread its own HW accelerator we have further increased data parallelism SyncQueue Complex workQ int pid fork If the HW is not the both threads do same work bottleneck this could be while Complex FixedPt vin a waste of resources workQ pop Do we multiplex the use for i 0 i 2 N i of the physical BUS …


View Full Document

MIT 6 375 - Hardware-Software Codesign

Documents in this Course
IP Lookup

IP Lookup

15 pages

Verilog 1

Verilog 1

19 pages

Verilog 2

Verilog 2

23 pages

Encoding

Encoding

21 pages

Quiz

Quiz

10 pages

IP Lookup

IP Lookup

30 pages

Load more
Download Hardware-Software Codesign
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hardware-Software Codesign and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hardware-Software Codesign and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?