TitleSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Practical Multi-access by Exploiting Spacial Diversity in 802.11bHuimin ZengECE 734 Spring 2010Outline●Introduction●Overview of problem●Processing requirement●FIR filter design with SIMD●Multicore scheduling●ConclusionIntroduction●Demand for high data rate in wireless access●Exploit diversity is the key to increase the capacity ●Basic categories–Frequency –Time–Code –SpaceSpacial Multi-access●Base station (BS) equipped with an antenna array allowing to separate the signals from multiple mobile stations who sharing the same frequency band and time slot●In a strong multipath propagation, spatial diversity implementation is preferred over beamforming ●But the challenge in decoding phase is the effect of the intersymbol interference from multiple usersA Simple scenario●Design a practical spatial multi-access in 802.11b–BS try to decode two overlapped frames from two locations–Mobile device does not changeST 1ST 2BSh11h22h21h12x1x2y1y2y1 = h11*x1 + h21*x2y2 = h12*x1 + h22*x2How to decode overlapped frames?●By applying Interference Alignmenty1h11* x1h21* x2h22* x2h11* x1y2y1h11* x1h21* x2R*h22* x2R*h11* x1R*y2Decode Procedure●Use the clean preamble of the first arrived frame, Pa, of the two overlapped frames to align the signals●Subtract the two signals to get the signal with the information of the second frame, Pb, decode Pb●Re-encode Pb, and cancel it out from the original signal, to get the signal with the information of Pa, decode PaDSP for 802.11b PHY layerScrambleQPSKModDS-SSUp sampling2Mbps 32Mbps 352Mbps1.4Gbps2MbpsMacTransmissionDe-ScrambleQPSKDemodDS-SSDecodeDown sampling2Mbps 32Mbps 352Mbps1.4Gbps2MbpsMacReceptionDecode PbRegenerate PbDecode PaProcessing Requirement●High system throughput–1.4 Gbps●High computation intensity–If N ops per bit, requiring 1.4n G ops per sec●Real-time requirementSora Platform●Radio control board has a maximum throughput of PCIe x32, which is 64 Gbps●Sora software support –Multi-core processing–Intel SSE●Therefore, in this project, I focus on applying multi-core scheduling and SIMD to optimize the processing speedChannel Equalizer for Interference Alignment●To reduce the complexity, I choose MMSE equalizer, which is a sub-optimal linear filter. It minimize the difference between the output signals and the know transmitted training signals–c = arg min {c} || x – c * r ||Linear FIR FilterCoefficient TrainingReceived SignalEqualized SignalAlgorithm to compute c1: \\ Initialize {ci}: 2: c0 = 13: for each j <> 04: cj = 05: end for 6: 7: \\ Training: 8: for each sample index i do 9: for k = 0 .. K – 1 do 10: x_estk+k = 011: for l = 0 .. L do 12: x_esti+k = x_esti+k + clyi+k-l13: end for 14: errk = xi+k – x_esti+k15: 16:for j = 0 .. L do17: dj = 018: end for 19: w = 0 20: for j = 0 .. L do21: dj = dj + step errk yi+k-l*22: w = w + |yi+k-j|223: end for 24: end for 25: for j = 0 .. L do26: cj = cj – dj/w27: end for 28:end forDynamic Range Analysis ●Inputs: x and y, 16bit X2 per sample●Outputs: c, 16bit per coefficient, 16 tapsVariable Date type Dynamic range bitsAdditional fractional bitsTotal register length {cj} fixed point 16 - 16 {xi} integer 16 - 16 {yi} integer 16 - 16{x _es t } fixed point 20 1 21{er r } fixed point 21 1 22{d} fixed point 34 1 35{w} integer 36 0 36FIR filter design with SIMD●The FIR filter is used in both training and equalizing stages. It is described as: ●Intel SSE supports 128-bit packed vector, each FIR sample takes 32 bits, so 4 calculations can be performed simultaneously. y n=∑i=0Lci xn−i FIR filter design (cont.)0 0 0 c0c1c2c3c4c5c6c7... outputx18x17x16x15(x14x13x12x11) (x10x9x8... y15x18x17x16x15(x14x13x12x11) (x10x9... y16x18x17x16x15(x14x13x12x11) (x10... y17x18x17x16x15(x14x13x12x11) ... y18FIR filter design (cont.)●Memory lay out of the FIR filter coefficients0 0 0 c00 0 c0c10 c0c1c2c0c1c2c3c1c2c3c4........c12c13c14c15c13c14c150c14c150 0c150 0 0FIR filter design (cont.)●SSE2 code1: // Load four 32-bit samples2: movdqa xmm0, [esi];3:4: // compute the four results with the first four rows in the FIR filter coefficient table5: mov edx, Coff // reset coefficient index6: mov edi, Buff // reset temporary accumulated buffer index 7: movdqa xmm1, xmm0;8: pmullw xmm1, [edx]; // edx is coefficient index9: paddsd xmm1, [edi]; // edi is temporary accumulated buffer index10: movdqa xmm21, xmm0;11: pmullw xmm2, [edx+4]; // edx is coefficient index12: paddsd xmm2, [edi+4]; // edi is temporary accumulated buffer index13: movdqa xmm3, xmm0;14: pmullw xmm3, [edx+8]; // edx is coefficient index15: paddsd xmm3, [edi+8]; // edi is temporary accumulated buffer index16: movdqa xmm4, xmm0;17: pmullw xmm4, [edx+12]; // edx is coefficient index18: paddsd xmm4, [edi+12]; // edi is temporary accumulated buffer indexFIR filter design (cont.)19:// extract output from the four registers and pack them into single 128-bit output20:paddsd xmm1, [ecx]; // ecx is the mask index 21:paddsd xmm2, [ecx+4]; 22:paddsd xmm3, [ecx+8]; 23:paddsd xmm4, [ecx+12]; 24:paddsd xmm1, xmm2;25:paddsd xmm1, xmm3;26:paddsd xmm1, xmm4;27:movdqa [ebx], xmm1; //ebx is the output memory addressFIR filter design (cont.)28: mov xmm1, [edi]; 29: // Multiply each of the rest of rows in the FIR filter coefficient table30: // and update the temporary accumulated buffer31: mov eax, 19; // set total number of iterations32: loop: 33: movdqa xmm1, xmm0;34: pmullw xmm1, [edx+16]; 35: paddsd xmm1, [edi+16]; // edi is temporary accumulated buffer index 36: movdqa [edi], xmm1; // store the temporary accumulated result37: add edx, 16; // next coefficient index38: add edi, 32; // next temporary accumulated buffer index39: dec eax; 40: jnz loop;Multi-core Scheduling●Still working on it …Conclusion●Spacial diversity exploitation is a good additional technique to increase the capacity in LAN multi-access ●Although diversity gain implies an increase in computational complexity, it can be implemented even with general purpose
View Full Document