U of I CS 420 - Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs - D1365433

Home> Schools> University of Illinois> Computer Science (CS) > CS 420> Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs

U of I CS 420 - Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs

School name University of Illinois

Course Cs 420- Parallel Progrmg: Sci & Engrg

Pages 10

Download Save

Unformatted text preview:

Optimal Tracing Robert Dept and Replay for Debugging Parallel Programs Barton H B Netzer of Computer Brown Box Providence Science University Computer University Dept WI St 53706 b art cs wise edu original bug This non repeatability m akes it difficult to use traditional sequential debugging techniques that require repeated execution Therefore a mechanism for tracing and then replaying a program execution is an essential p art of parallel progr am debugging A critical cost in such a mechanism is the cost of tracing In this paper we present a trace and replay mechanism for message passing p ar allel programs that is optimal in the common c ase and has good performance in the worst case We significantly reduce the number of messages that need be traced improving by up to two orders of magnitude e arlier techniques that trace every message With such a reduction long nmning programs can now be replayed that could not have been previously replayed Our tracing works by making decisions at run time tracing often the minimal set of messages needed to provide replay Experiments show that only 1910of the messages usually need be traced Abstract A common debugging strategy involves reexecuting a program on a given input over and over each time gaining more information about bugs Such techniques can fail on message passing parallel programs Because of variations in message latencies and process scheduling different runs on the given input may produce different results This non repeatability is a serious debugging problem since an execution cannot always be reproduced to track down bugs This paper presents a technique for tracing and replaying message passing programs for debugging Our technique is optimal in the common case and has good performance in the worst case By m aking run time tracing decisions we trace only a fraction of the total number of messages g aining two orders of magnitude reduction over traditional techniques which trace every message Experiments indicate that only 1 of the messages often need be aced These traces are sufficient to provide replay allowing an execution to be reproduced any number of times for debugging Our work is novel in that we use run time decisions to detect and trace only those messages that introduce nondeterminacy With our strategy large reductions in trace size allow long running programs to be replayed that were previously unmanageable In addition the reduced tracing requirements alleviate tracing bottlenecks allowing executions to be debugged with substantially lower execution time overhead In a trace and replay scheme the order in which messages are delivered but not their contents is first traced during execution These traces are then used during re execution to force each message to be delivered to the same operation as during the traced execution Tracing the original execution is necessary because some messages may race with others introducing nondeterminacy into the execution Two messages race if they are simultaneously in tmnsit and either could arrive first at some receive operation If the original order in which racing messages are delivered is not recorded their order cannot always be reproduced during replay However by tracing the original message deliveries and then forcing them to occur during replay the computation and all its messages can will be exactly reproduced An erroneous execution then be repeatedly replayed to analyze the execution more carefully and gain information about bugs 1 Introduction Parallel progr ams can be nondeterministic When processes communicate by passing messages variations in scheduling and message latencies can cause two executions of the same program on the same input to produce different results Such nondeterminacy may be intended but it can cause serious problems while debugging subsequent executions of the program may not reproduce the Our main result is a tracing strategy using on thefly detection of racing messages that is optimal in most t Interactions TtNs work was supported m part by National Scrence Foundation grants CCR 88 15928 and CCR 9 100968 Office of Naval Research grant NOOO14 89 J 1222 and a grant from Sequent Computer Systems Inc with the external enwronment must also be repro duced such as return values from system calls However these interactions must be reproduced to replay sequential programs as well 502 3 00 1992 IEEE Sciences of Wisconsin Madison Madison 02912 rn cs brown edu 1063 9535 92 P Miller 1210 W Dayton 1910 RI Message Passing cases and effective even in the worst c ase Instead of tracing every message as earlier schemes propose our technique checks each message to determine if it races with another and traces only one of the racing messages When a message is received a race check is performed by analyzing the execution order between the previous receive operation in the same process and the message necessary for this sender The ordering information check is m ain ined during execution by appending vectortimesmmpson tousermessages This strategy iseffective because the racing messages are exactly those that introduce nondeterminacy into the execution Our work is novel in that only the racing messages are traced In contmst e arlier trace and replay schemes for message p assing programs require tracing every message Replay w as first introduced by Curtis and Wittie in the BugNet system for debugging distributed C proLeBlanc and Mellor Crummey 4 also grams l addressed replay but considered both sh ared memory and message passing parallel programs They trace only the order in which messages are delivered and not their contents By reproducing only the order of message deliveries their contents and hence the original computation will also be reproduced However both of these schemes require emitting some type of trace for every message Tracing every message can require huge amounts of storage for long running programs m aking them impossible to debug In addition as processors become faster and p arallel machines become l arger tracing becomes an increasing bottleneck To replay the execution for debugging we must first trace the order in which the messages are delivered and then use this trace to force a re execution to exhibit the same message deliveries Earlier trace and replay schemes propose tracing the order in which all messages are delivered 1 4 For example they would record that Msgl was delivered to the first Reev in P2 and that Msg2 w as delivered to the second Recv

View Full Document