Unformatted text preview:

COSC 6374 Parallel Computation Performance Modeling Edgar Gabriel Spring 2009 Edgar Gabriel Motivation Can we estimate the costs for a parallel code in order to Evaluate quantitative and qualitative differences between different implementation alternatives Understand the parameters effecting the performance of the application Understanding relevant hardware characteristics Restrictions Any analytical model can not replace real measurements since parallel systems are too complex and unpredictable COSC 6374 Parallel Computation Edgar Gabriel 1 How to model collective operations E g MPI Bcast strongly depending on the algorithm used to implement the operation One process root process distributes the same data items to all members within a process group communicator Linear Algorithm the root process sends one message to each process in the communicator if rank root for i 0 i size i if i root MPI Send buf cnt dat i TAG comm else MPI Recv buf cnt dat root TAG comm stat COSC 6374 Parallel Computation Edgar Gabriel Linear Algorithm I Hockney s Model s message size l latency b bandwidth 1 t s l s b Estimate of the execution time according to Hockney s model for p processes 1 t s p l s p 1 b 4 1 COSC 6374 Parallel Computation Edgar Gabriel 2 Linear Algorithm II Using non blocking operations if rank root for i 0 i size i MPI Isend buf cnt dat i TAG comm req i MPI Recv buf cnt dat root TAG comm stat if rank root MPI Waitall size req statuses Formula 4 1 is now arbitrarily wrong Several communications simultaneously ongoing Maximum optimal number of messages depending on message size and network parameters COSC 6374 Parallel Computation Edgar Gabriel How does communication really work I Two protocols usually used internally Eager protocol message is sent immediately to the receiver without waiting for the according receive to be posted Usually used for short messages e g 1 KB in Open MPI Rendezvous protocol Send a header to receiver Wait for an acknowledgment receive has started Send message data Avoids having to buffer large messages on the receiver process unexpected messages COSC 6374 Parallel Computation Edgar Gabriel 3 How communication really works II Three levels of buffering Application level e g MPI Bsend MPI library level unexpected message queues System buffering System buffering works similarly to file systems e g for sockets data is copied into socket buffer before sending MPI Send returns as soon as data is in the socket buffer No way to alternate this data anymore so it is safe to return control to the application COSC 6374 Parallel Computation Edgar Gabriel How communication really works III For a short message socket buffer size sbsize Data copied into socket buffer write operation on the according socket called MPI Send returns control to the application in a time which is shorter than the network latency For a long message Large message is split into chunks of size sbsize A chunk of the data is copied into socket buffer and sent As soon as the receiving process acknowledges the receipt of the data chunk the next chunk is copied into socket buffer etc COSC 6374 Parallel Computation Edgar Gabriel 4 How communication really works IV So transfer of a large message looks like Sending a small chunk Wait Sending a small chunk Wait This behavior is not modeled by Hockney but e g by the LogGP model Based on LogGP one should split a large message into smaller chunks and send them simultaneously for a bcast operation Hide the gap by using a different channel COSC 6374 Parallel Computation Edgar Gabriel Multi segmented linear algorithm nmgs cnt scnt if rank root for j 0 j nmsgs j tbuf buf j scnt for i 0 i size i MPI Isend tbuf scnt dat i TAG comm req 2 j i for j 0 j nmsgs j tbuf buf j scnt MPI Irecv tbuf scnt dat root TAG comm rreq j if rank root MPI Waitall size nmsgs req statuses MPI Waitall nmsgs rreq rstatuses COSC 6374 Parallel Computation Edgar Gabriel 5 Binary and Binomial Trees 0 0 2 1 4 3 8 9 6 5 10 11 12 14 13 2 3 4 7 8 9 10 5 11 6 12 13 14 15 Number of messages increase with every iteration network saturated starting from a certain number of messages message segmenting can improve the performance as well COSC 6374 Parallel Computation Edgar Gabriel Chain Algorithms 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 n n n n 0 Segment a message and pass them from one process to another Performs very well for very large messages 7 1 n COSC 6374 Parallel Computation Edgar Gabriel 6 k Chain Algorithm 0 e g k 5 1 2 3 4 5 COSC 6374 Parallel Computation Edgar Gabriel Hockney s Model t s l s b 14 1 l latency of the network b bandwidth of the network How can we determine the latency and the bandwidth Ping pong benchmark process A sends a message to process B process B sends message back Advantage does not require synchronized clocks between A and B Disadvantage assumes symmetric communication performance costs A B costs B A To determine latency execute ping pong benchmark for cnt 0 COSC 6374 Parallel Computation Edgar Gabriel 7 Ping pong benchmark comm MPI COMM WORLD for i 1 i MAX MSG LEN i 2 t1 MPI Wtime for j 0 j MAX MEASUREMENTS j if rank 0 MPI Send buf i MPI INT 1 1 comm MPI Recv buf i MPI INT 1 1 comm status else if rank 1 MPI Recv buf i MPI INT 0 1 comm status MPI Send buf i MPI INT 0 1 comm t2 MPI Wtime if rank 0 printf Msg len d avg exec lf bandw d n i t2 t1 2 MAX MEASUREMENTS i sizeof int t2 t1 2 MAX MEASUREMENTS 6374 Parallel Computation COSC Edgar Gabriel 15 Ping pong benchmark II COSC 6374 Parallel Computation Edgar Gabriel 8 Ping pong benchmark II To determine bandwidth have to determine the saturation point Required message length does depend on the network bandwidth COSC 6374 Parallel Computation Edgar Gabriel 17 LogP Model published by Culler et all Parameters L upper bound on the latency o overhead defined as the length of the time that a process is engaged in the transmission or reception of a message During this time the process can not perform other operations g gap defined as the minimum time interval between consecutive message transmissions or receptions The reciprocal time of g corresponds to the per process communication bandwidth P number of processors COSC 6374 Parallel Computation Edgar Gabriel 9 LogP II Start sending Sender Message enters network o Message leaves network Receiver L o End receiving t L 2o Costs for sending a messages 19 1 COSC 6374 Parallel Computation Edgar Gabriel LogP II Start sending g Sender o g o Receiver L o o End receiving Costs for sending


View Full Document

UH COSC 6374 - Performance Modeling

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Performance Modeling and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Performance Modeling and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?