Compiling for Communication ProcessorsBackground and CollaboratorsOutline of TalkIntel eXchange Architecture ®IXP Processing Element (PE)Cisco ToasterPOS IPv4 DiffServ ForwardingNetwork Processor PerformanceDimensions of Concurrent Memory Access in a Network ProcessorSlide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20IXP Compiler “Autopartioning” ModeMixed-Mode DevelopementPacket Processing Stage (PPS)PipesExample: IPv4 Receive PPSExample: Pipe UsagePerformance SpecificationExample:Performance SpecificationAuto-Partitioning AlgorithmIXP C Compiler: Pruning the Search SpaceContext PipeliningContext Pipelining (Flow Graph)Context Pipelining AlgorithmSoftware PipeliningCAMMemory Access VectorizationAccessing Multiple ChannelsSlide 38Demo Application: IPv4 ForwardingMicroEngine AllocationThe Rx Stage: Source CodeThe Receive (Rx) Stage: Generated MicrocodeAutopartitioning Exploration1Compiling for Communication ProcessorsLuddy Harrison2Background and CollaboratorsMost of these compiler ideas are embodied in the next-generation compiler for Intel’s IXA®.My collaborators on IXP C were:Prashant Chandra (Intel)Bo Huang (Intel)Jason Dai (Intel)Paul Li (Intel)3Outline of TalkArchitecture of two (quite distinct) network processorsIntel IXPCisco ToasterArchitectural support for concurrent memory accessCorresponding programming mechanisms to realize concurrencyComparison of these mechanisms IXP, ToasterIntel’s Autopartitioning IXP CompilerArchitecture of the compilerMajor TransformationsExploration of compilation alternatives4Intel eXchange Architecture ®PE PE PEPEPE PE PEPESRAM 1 SRAM 2 DRAMMSF HASH CRC PCINearestneighborregs +signals5IXP Processing Element (PE)context 7context 6context 5context 4context 3context 2context 1context 0ALULOCALMEMORYINSN MEMORYCAM6Cisco ToasterPE PE PEPE PE PEPE PE PESRAM SRAM SRAM.7POS IPv4 DiffServ ForwardingIn ASM, stages 1 and 2 are further divided into sub-stagesPkt RxRx PPSCsix TxQMCsix SchWREDTx PPSQM PPSSch PPSDiffserv PPSSRTCMMeterIPv4Fwd6tupleClassifier8Network Processor PerformanceThroughput / bandwidthPackets in / out per secondLatency of a packet through the system is immaterialIncreasing effective memory bandwidth is, roughly speaking, the sole aim of the architecture.9Dimensions of Concurrent Memory Access in a Network ProcessorMultiple memory channelsAccessible from a single PE (insn scheduling)Accessible from multiple PEs (context pipelining)Pipelined memory channel (multithreading)Wide memory accesses (vectorization)Reuse of dependent accesses and concurrent issue of independent accesses (CAM)10 load channel X (packet 1)load channel X (packet 2)load channel X (packet 4)load channel X (packet 3)use load resultuse load resultuse load resultuse load resultPipelined memory controller/ MultithreadingHere the concurrency is within a single memorycontroller (multiple accesses are pipelined).11R10 = T1 + R4...THREAD 1R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)R1 = R2 + R3T1 = LOAD R1 (signal 1)R4 = R5 + R6R7 = R8 - R9WAIT (signal 1)THREAD 2 THREAD 3Time forLOAD tocomplete(all signals and registers arecontext-relative)Multithreading ala IXPR10 = T1 + R4...R10 = T1 + R4...12ROW 1R1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4ROW 2 ROW 3Multithreading ala ToasterR1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4R1 = R2 + R3T1 = LOAD R1R4 = R5 + R6R7 = R8 - R9R1 = T1 + R4The machine is synchronous; there are nohardware interlocks.Here, the rows act as threads.13load channel APE 1 PE 2load channel Bload channel Cload channel Duse use use usepipelining of packet datamulitple memory controllers/ context pipeliningpkt 1 pkt 2 pkt 3 pkt 4Here the concurrency is overdifferent memory controllers.Note that the loads for a singlepacket need not be overlapped.pkt 1 pkt 2 pkt 3pkt 1 pkt 2pkt 1timePE 3 PE 414 load channel X (packet 1) load channel Y (packet 1) load channel Z (packet 1)multiple memory controllers/ instruction schedulinguseuseuseHere, the concurrency is between different memorycontrollers accessible by a single processing element.The loads for a single packet must be overlapped torealize this kind of parallelism.15PE PE PEPE PE PEPE PE PESRAM SRAM SRAM.context pipeliningmultithreading16x = p->a; // load 32...y = p->b; // load 32x_y = p->a_b; // load >= 64IFx = p->a y = p->b; IFx_y = p->a_b;unconditional; samebandwidth utilizationspeculative;increased bandwidthutilizationMemory Access Vectorization17key Nkey 2key 1LM index NLM index 2LM index 1LocalMemoryLM index iline iContent-Addressable Memory ala IXPThe line size and format is determined by softwareLRULRULRU18R = load Aload latencyS = modify(R)store A, Rstore latencyR = load Aload latencyS = modify(R)store A, Rstore latencyR = load Aload latencyS = modify(R)store A, Rstore latencyWithout CAMCriticalSection19i = lookup A (miss)L = load ALM[i] = modify Li = lookup A (hit)L = LM[i]LM[i] = modify Li = lookup A (hit)L = LM[i]LM[i] = modify LCAM - Hit Casestore A, LM[i]Like a cache, this permits multiple logical accesses tothe same location to share a single physical access.We pay for 1 load and store. The modifications mustbe sequenced of course.20LM[i] = modify Lstore A, LM[i]i = lookup A (miss)L = load Ai = lookup B (miss)L = load Bi = lookup C (miss)L = load CCAM - Miss CaseHere we are able to overlap all loads and stores! TheCAM is being used to break dependences.LM[i] = modify Lstore B, LM[i]LM[i] = modify Lstore C, LM[i]21IXP Compiler “Autopartioning” Mode“Whole-chip” programmingApplication is made up of “packet processing stages” (PPS) written in IXP CEach PPS has an associated performance specificationCompiler handles the mapping of a PPS to MEs/XScaleStandard, sequential C semanticsCompiler manages multi-threading and synchronizationCompiler manages low-level hardware resourcesUser-managed resources are presented as a library (e.g., PCI)ME0 ME1 MEnAutopart.CompilerAssemblerLinkerXScalePPS0.c, .hPerf SpecPPSm.c, .hPerf Spec22Mixed-Mode DevelopementMEv210MEv211MEv212MEv215MEv214MEv213MEv29MEv216MEv22MEv23MEv24MEv27MEv26MEv25MEv21MEv28RDRAMControllerIntel®XScale™CoreMediaSwitchFabricInterfacePCIQDR SRAMControllerScratchMemoryHash UnitMulti-Threaded (x8) Microengine ArrayPer-EngineMemory, CAM,
View Full Document