Unformatted text preview:

Stanford University EE482C Advanced Computer Architecture & Organization Project Report Multi-Node Programming June 6, 2002 Henry FU Harn Hua NG Yeow Cheng ONGEE482C Spring 2002 Project Report - 1 -1. Project goals In this project, methods for mapping stream programs over multiple stream-processing nodes are developed and evaluated. Specifically, these methods are used to partition data and/or instructions across the nodes, communicate data/state information to coordinate the processors and perform load balancing. 1.1 Application The example chosen for this project is that of IP Packet Routing – Longest Prefix Address Matching. It is a common application used in routers in the Internet to assign next-hop addresses in packets' paths to their final destination. IP Packet: Destination Address ⇒ Routing Table Entry:NetworkAddressNetwork Mask Next-HopAddress 1.2 Findings IP Address Longest Prefix Matching is a suitable applications for illustrating multi-node Stream processing as packet traffic lends itself to Data-Level, Thread-Level and Instruction-Level parallelism. For pipelined configurations, the speedup increases linearly with the number of processors, and for parallel configurations, there are synchronization and communication overheads which cause the rate of increase of speedup to decrease as the number of processors increases.Figure 1a. Mask and Match to Find Corresponding Next-Hop AddressEE482C Spring 2002 Project Report - 2 -2. Implementation 2.1 Metric The execution time of a single Stream Processor configuration is compared against that of a multi-node configuration. Speedup is plotted against the number of Imagines in the configuration to see the effects on execution time. 2.2 Setup To implement the IP Routing application, the development environment provided by the Imagine Stream Processor Project at Stanford University is used. The code is written in StreamC and KernelC and simulated in the IDebug simulator. A performance measurement feature known as Profiling is used to record the execution times in cycle counts. Consider the baseline case in Figure 2a, where only 1 host process and 1 Imagine are used. In this configuration, there is only one parallel lane and one pipeline stage. All routing table entries and all data packets are given to a single Imagine processing node. All multi-node performance results are normalized to that of this baseline case. Both parallel and pipelined configurations are tested for each subsequent configuration. ImagineHostFigure 2a. Baseline configurationEE482C Spring 2002 Project Report - 3 -Other multi-node pipelined version includes two Imagines with two pipeline stages, as shown in Figure 2b, where the routing table is split into two, given to each Imagine. Both Imagines process the destination addresses in the packet traffic before the correct next-hop address is computed. The case with four Imagines and four pipeline stages is shown in Figure 2c below, where the routing table entries are split into four parts. The parallel versions of a multi-node configuration includes two Imagines with two parallel lanes (Figure 2d), and four Imagines with four parallel lanes as in Figure 2e, where data traffic is split into two and four parts respectively. In the parallel version, output data from all Imagine Host Imagine Figure 2b. 2 Imagines, 2 pipeline stages Imagine HostImagine Imagine Imagine Figure 2c. 4 Imagines, 4 pipeline stagesEE482C Spring 2002 Project Report - 4 -the Imagines are sent to the last Imagine to be combined before producing the final output. Our code was divided into the following modules: • (StreamC) Main function in Host Processor • (KernelC) Address Matching kernel • (KernelC) Kernel to combine outputs from other Imagines in a parallel configuration 2.2 Test Data Routing table entries are captured from a major router in an ISP. The targeted ISP is known as ner-routes.bbnplanet.net and 119,967 routing table entries are obtained. A subset of these entries is randomly selected as test data. The corresponding result files are generated using a program written in C for easy comparison with the output produced by the Imagine application. HostImagine Imagine Figure 2d. 2 Imagines, 2 parallel lanesHostImagine Imagine Imagine Imagine Figure 2e. 4 Imagines, 4 parallel lanesEE482C Spring 2002 Project Report - 5 -2.3 Longest Matching Prefix Algorithm In the kernel, a batch of eight data packets is given to the eight clusters present in one Imagine processor, i.e. one packet per cluster. A software kernel loops through all the routing table entries, and logically left-shifts the mask portion until it becomes zero, to find the maximum mask length. After that, routing table entries in batches of eight are communicated to all eight clusters, along with their respective mask lengths. Each cluster has all the eight routing table entries now. The long prefix match is then found by using the following logic: Match = (pkt destination addr AND mask) XOR each routing table entry's dest. addr If the resultant match is zero, then the mask length is compared to the latest longest match mask length for that particular destination address, and if this new mask length is greater, its corresponding routing table entry's next-hop replaces that of the previous next-hop address. At the end of the routing table stream, a stream of length eight, containing longest mask length and next hop addresses for the eight data packets is output. This stream is then passed to the next node in line as input. The very last Imagine in the line therefore contains the correct next hop addresses for the eight data packets. Figure 2f. Mask and Match LogicEE482C Spring 2002 Project Report - 6 -3.1 Results & Findings Pipelined Algorithm Execution Time # Packets # Entries # Imagines Imagine 0 Imagine 1 Imagine 2 Imagine 3 Avg/Img Speed Up 1024 1024 1 6697600 6697600 1 2 3281408 3420288 3350848 1.99877762 4 1569280 1716224 1716224 1708160 1677472 3.9926746932 1024 1 209300 209300 1 2 102544 106884 104714 1.99877762 4 49040 53632 53632 53380 52421 3.992674698 1024 1 52325 52325 1 2 25636 26721 26178.5 1.99877762 4 12260 13408 13408 13345 13105.25 3.992674698 512 1 25636 25636 1 2 12260 13408 12834 1.99750662 4 5950 6342 6720 6720 6433 3.985076958 64 1


View Full Document

Stanford EE 482C - Study Guide

Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?