Partial Orders for Parallel Debugging C J Fidge Department of Computer Science Australian National University Canberra ACT Australia ABSTRACT Parallel programs differ from sequential programs primarily in that the temporal relationsbips between events are ouly partially defined However for a given distributed computation debugging utilities typically linearize the observed set of events into a total ordering thus losing information and allowing potentially capturable temporal errors to We explore use of the partially ordered relation happened before to escape detection augment both centralized and distributed parallel debuggers to ensure that such errors are always detected and that the results produced by the debugger are unaffected by the non determinism inherent in the partial ordering This greatly reduces the number of tests required during debugging Assertions are based on time intervals rather than treating events as dimensionless points 1 INTRODUCTION This paper presents techniques for reliably detecting temporal errors in a distributed program where a temporal error is defined to be a violation of the intended partial ordering of events This is proposed as a supplement to existing debugging utilities as a way of increasing their consistency in the face of non determinism and thus reduce the number of tests required for a programmer to be satisfied that a program is correct As a compromise between efficieucy and usefulness the approach is developed methodically from first principles by starting with the simplest possible temporal relationship happened before for atomic dimensionless eve s and extended to finite time intervals The presentation is motivated by two styles of assertion one event and one state based Two different implementations are considered Firstly a centralized system is discussed based on existing debugging systems A more attractive distributed approach is then presented which attempts to minimize the impact of the debugging code on the program under test It is assumed that message passing is the only form of inter process communication Both synchronous and asynchronous communication are discussed although this work was originally motivated by consideration of synchronous systems Note that we are only concerned with relative event orderings no consideration is made of the issues associated with real time or performance debugging 2 MOTIVATION Faced with a black box parallel program and a set of assertions specifying the intended behaviour of the program a natural approach to debugging is to insert an observer or monitor process which receives reports from the program under test and checks that the assertions hold 2 The observer may or may not delay the program while checkiug the assertions This philosophy also underlies those systems which display significant events on a graphics termiua 1 where the onus is on the user to ensure that the program is seen to behave correctly or those in which a linear truce is generated for post mortem analysis 7 This work was motivated by the observation that any such system implicitly totally orders the events of interest This has a number of associat ed problems Firstly it is impossible for an observer process to distinguish between the case where the temporal ordering between two events is and is not enforced by a causal relationship 183 Indeed it is even possible for an observer in an asynchronous system to perceive an incorrect total ordering due to overtaking even if FIFO queueiug is guaranteed for each inter process channel i assumeb precedes a event a Admittedly this can not occur in the synchronous case or if an asynchronous observer blocks the observed processes by the use of ack signals Additionally the debugger itself may be non deterministically affected by the global time interleaving of events For exactly the same computation with the same inputs following the same control paths and generating the same results the debugger may produce different results due to differences in interleaving and delays in reporting events For example assuming synchronous communication b event a event a This forces the user to execute the same test again and again to ensure that the debugger has not missed a temporal error simply due to a fortuitous interleaving of events Experience in this area has shown that it is necessary to run a particular test several times with different process priorities in the hopes of forcing an error to manifest itself 7 It is important to note that this situation exists even with the use of reproducibility tools that do not totally order event traces section 3 1 Also the perceived ordering differs for each process in the system This problem is inherent in any attempt to add avzriliary communications to a parallel system to distribute information about past events As a degenerate example two processes simply informing each other of events will inevitably infer different temporal orderings in the absence of any additional information assume a precedes eventa assume b precedes b f a 1 event Finally the observer itself may alter the behaviour of the system being studied 9 This so called probe eflect 6 is particularly noticeable when t he observer blocks processes The stand taken here is that the interleaving of concurrent events should not affect the results produced by a parallel debugging system Any pronouncement on the correctness or otherwise of a particular test must be valid for any possible event ordering defined by that computation Additionally the perceived event ordering must be consistent for all observers We aim to achieve this by always working with the entire partial ordering of events defined by the computation rather than a single arbitrarily selected total ordering 3 BACKGROUND This section reviews related work on which the current assumptions made presentation is based and defines some of the 3 1 Reproducibility The value of reproducible tests for para llel progmms as a means of controlling non determinism during debugging is now widely recognized 16 and we accept it as a fundamental capability This effectively allows efficiency issues to be disregarded by a ssuming that the debugging tools are added to prerecorded 184 tests in particular an observer can block monitored processes without altering the system behaviour D21 Reproducibility is assumed to exist in its least intrusive form in which each process separately records the non deterministic choices made and uses them to guide later
View Full Document
Unlocking...