U of I CS 433 - Compiling for Communication Processors - D2110681

Home> Schools> University of Illinois> Computer Science (CS) > CS 433> Compiling for Communication Processors

DOC PREVIEW

U of I CS 433 - Compiling for Communication Processors

School name University of Illinois

Course Cs 433- Computer System Organization

Pages 43

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Compiling for Communication ProcessorsBackground and CollaboratorsOutline of TalkIntel eXchange Architecture ®IXP Processing Element (PE)Cisco ToasterPOS IPv4 DiffServ ForwardingNetwork Processor PerformanceDimensions of Concurrent Memory Access in a Network ProcessorSlide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20IXP Compiler “Autopartioning” ModeMixed-Mode DevelopementPacket Processing Stage (PPS)PipesExample: IPv4 Receive PPSExample: Pipe UsagePerformance SpecificationExample:Performance SpecificationAuto-Partitioning AlgorithmIXP C Compiler: Pruning the Search SpaceContext PipeliningContext Pipelining (Flow Graph)Context Pipelining AlgorithmSoftware PipeliningCAMMemory Access VectorizationAccessing Multiple ChannelsSlide 38Demo Application: IPv4 ForwardingMicroEngine AllocationThe Rx Stage: Source CodeThe Receive (Rx) Stage: Generated MicrocodeAutopartitioning Exploration1Compiling for Communication ProcessorsLuddy Harrison2Background and CollaboratorsMost of these compiler ideas are embodied in the next-generation compiler for Intel’s IXA®.My collaborators on IXP C were:Prashant Chandra (Intel)Bo Huang (Intel)Jason Dai (Intel)Paul Li (Intel)3Outline of TalkArchitecture of two (quite distinct) network processorsIntel IXPCisco ToasterArchitectural support for concurrent memory accessCorresponding programming mechanisms to realize concurrencyComparison of these mechanisms IXP, ToasterIntel’s Autopartitioning IXP CompilerArchitecture of the compilerMajor TransformationsExploration of compilation alternatives4Intel eXchange Architecture ®PE PE PEPEPE PE PEPESRAM 1 SRAM 2 DRAMMSF HASH CRC PCINearestneighborregs +signals5IXP Processing Element (PE)context 7context 6context 5context 4context 3context 2context 1context 0ALULOCALMEMORYINSN MEMORYCAM6Cisco ToasterPE PE PEPE PE PEPE PE PESRAM SRAM SRAM.7POS IPv4 DiffServ ForwardingIn ASM, stages 1 and 2 are further divided into sub-stagesPkt RxRx PPSCsix TxQMCsix SchWREDTx PPSQM PPSSch PPSDiffserv PPSSRTCMMeterIPv4Fwd6tupleClassifier8Network Processor PerformanceThroughput / bandwidthPackets in / out per secondLatency of a packet through the system is immaterialIncreasing effective memory bandwidth is, roughly speaking, the sole aim of the architecture.9Dimensions of Concurrent Memory Access in a Network ProcessorMultiple memory channelsAccessible from a single PE (insn scheduling)Accessible from multiple PEs (context pipelining)Pipelined memory channel (multithreading)Wide memory accesses (vectorization)Reuse of dependent accesses and concurrent issue of independent accesses (CAM)10 load channel X (packet 1)load channel X (packet 2)load channel X (packet 4)load channel X (packet 3)use load resultuse load resultuse load resultuse load resultPipelined memory controller/ MultithreadingHere the concurrency is within a single memorycontroller (multiple accesses are pipelined).11R10 = T1 + R4...THREAD 1R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)THREAD 2 THREAD 3Time forLOAD tocomplete(all signals and registers arecontext-relative)Multithreading ala IXPR10 = T1 + R4...R10 = T1 + R4...12ROW 1R1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4ROW 2 ROW 3Multithreading ala ToasterR1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4R1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4The machine is synchronous; there are nohardware interlocks.Here, the rows act as threads.13load channel APE 1 PE 2load channel Bload channel Cload channel Duse use use usepipelining of packet datamulitple memory controllers/ context pipeliningpkt 1 pkt 2 pkt 3 pkt 4Here the concurrency is overdifferent memory controllers.Note that the loads for a singlepacket need not be overlapped.pkt 1 pkt 2 pkt 3pkt 1 pkt 2pkt 1timePE 3 PE 414 load channel X (packet 1) load channel Y (packet 1) load channel Z (packet 1)multiple memory controllers/ instruction schedulinguseuseuseHere, the concurrency is between different memorycontrollers accessible by a single processing element.The loads for a single packet must be overlapped torealize this kind of parallelism.15PE PE PEPE PE PEPE PE PESRAM SRAM SRAM.context pipeliningmultithreading16x = p->a; // load 32...y = p->b; // load 32x_y = p->a_b; // load >= 64IFx = p->a y = p->b; IFx_y = p->a_b;unconditional; samebandwidth utilizationspeculative;increased bandwidthutilizationMemory Access Vectorization17key Nkey 2key 1LM index NLM index 2LM index 1LocalMemoryLM index iline iContent-Addressable Memory ala IXPThe line size and format is determined by softwareLRULRULRU18R = load Aload latencyS = modify(R)store A, Rstore latencyR = load Aload latencyS = modify(R)store A, Rstore latencyR = load Aload latencyS = modify(R)store A, Rstore latencyWithout CAMCriticalSection19i = lookup A (miss)L = load ALM[i] = modify Li = lookup A (hit)L = LM[i]LM[i] = modify Li = lookup A (hit)L = LM[i]LM[i] = modify LCAM - Hit Casestore A, LM[i]Like a cache, this permits multiple logical accesses tothe same location to share a single physical access.We pay for 1 load and store. The modifications mustbe sequenced of course.20LM[i] = modify Lstore A, LM[i]i = lookup A (miss)L = load Ai = lookup B (miss)L = load Bi = lookup C (miss)L = load CCAM - Miss CaseHere we are able to overlap all loads and stores! TheCAM is being used to break dependences.LM[i] = modify Lstore B, LM[i]LM[i] = modify Lstore C, LM[i]21IXP Compiler “Autopartioning” Mode“Whole-chip” programmingApplication is made up of “packet processing stages” (PPS) written in IXP CEach PPS has an associated performance specificationCompiler handles the mapping of a PPS to MEs/XScaleStandard, sequential C semanticsCompiler manages multi-threading and synchronizationCompiler manages low-level hardware resourcesUser-managed resources are presented as a library (e.g., PCI)ME0 ME1 MEnAutopart.CompilerAssemblerLinkerXScalePPS0.c, .hPerf SpecPPSm.c, .hPerf Spec22Mixed-Mode DevelopementMEv210MEv211MEv212MEv215MEv214MEv213MEv29MEv216MEv22MEv23MEv24MEv27MEv26MEv25MEv21MEv28RDRAMControllerIntel®XScale™CoreMediaSwitchFabricInterfacePCIQDR SRAMControllerScratchMemoryHash UnitMulti-Threaded (x8) Microengine ArrayPer-EngineMemory, CAM,

View Full Document