DOC PREVIEW
Stanford CS 295 - Study Notes

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Using Correlated Surprise to Infer Shared InfluenceAdam J. Oliner, Ashutosh V. Kulkarni, and Alex AikenStanford University∗Department of Computer Science{oliner, ashutosh.kulkarni, aiken}@cs.stanford.eduAbstractWe propose a method for identifying the sources of prob-lems in complex production systems where, due to the pro-hibitive costs of instrumentation, the data available foranalysis may be noisy or incomplete. In particular, we maynot have complete knowledge of all components and theirinteractions. We define influences as a class of componentinteractions that includes direct communication and re-source contention. Our method infers the influences amongcomponents in a system by looking for pairs of componentswith time-correlated anomalous behavior. We summarizethe strength and directionality of shared influences usinga Structure-of-Influence Graph (SIG). This paper explainshow to construct a SIG and use it to isolate system mis-behavior, and presents both simulations and in-depth casestudies with two autonomous vehicles and a 9024-node pro-duction supercomputer.1 IntroductionConsider a complex production system in which some-thing goes wrong: a performance glitch, a strange result, oran outright crash. How might we identify the source of theproblem? A fundamental difficulty is that the costs of in-strumentation in production systems are often prohibitive.Significant systems are invariably constructed from manyinteracting subsystems, and we cannot expect to have mea-surements from every component. In fact, in many systemswe will not even know of all the components or of the in-teractions among the components we do know. This paperis about analyzing systems as they are, generating a poten-tially partial diagnosis from whatever data is available.Our method requires only that some of the components inthe system are instrumented to generate timestamped mea-surements of their behavior. The type of measurements maydepend on the type of component (e.g., a laser sensor maybe instrumented differently than a hard disk). Thus, we need∗This work was supported in part by NSF grant CCF-0915766 and theDOE High-Performance Computer Science Fellowship.a way to compare measurements of different components ina uniform way. We address this issue, and the related ques-tion of how to summarize different kinds of measurementsfrom a single component, by mapping all components’ be-havior to a single dimension: surprise. That is, our methodquantifies how anomalous individual component behavioris, as an anomaly signal, using deviation from a model ofnormal component behavior. An important feature of ouranomaly signals is that they are real-valued, meaning thatthe degree to which a component’s behavior is anomalousis retained, rather than the common approach of discretizingbehavior into “normal” and “abnormal”.When two anomaly signals are correlated, meaning thattwo components tend to exhibit surprising behavior aroundthe same time, we say that the components share an influ-ence. This correlation can arise from a number of interac-tions, including direct communication and contention for ashared resource. Not all interactions are instantaneous, sowe use effect delays—how long it tends to take an anomalyin one component to manifest itself in another—to establishdirectionality. Correlation is a pairwise relationship and de-lay is directional, so the most natural structure to summarizeinfluence is a graph. A Structure-of-Influence Graph (SIG)encodes strong influence as an edge between components,with optional directionality to represent a delay.Passively collected data, if devoid of hints like “compo-nent A sent a message to component B,” cannot be used toinfer causality: the strongest possible mathematical state-ment is that the behavior of one component is correlatedwith another. An advantage of using statistical correlationis that it enables asking “what-if” queries, after the fact.For example, it is easy to add a new “component” whoseanomaly signal is large around the time bad behavior wasobserved. Other, real, components that share influence withthe synthetic component are likely candidates for contribu-tors to the problem.Our goal is to generate a structure, informed by modelsof component behavior, that enables a user to more eas-ily answer prediction and diagnosis questions. The SIGmethod has several desirable properties:• Building a SIG requires no intrusive instrumenta-tion; no expert knowledge of the components; and noknowledge about communication channels (e.g., thedestination of a message), shared resources, or mes-sage content. Our method is passive and can treat com-ponents as black boxes.• Influence describes correlation, not causality. A keyfeature of our approach is to drop the assumption thatwe can observe all component interactions and focuson the correlations among behaviors we can observe(see Section 2).• By working directly with a real-valued, rather than bi-nary, anomaly signal, our method degrades gracefullywhen data is noisy or incomplete.• Our experimental results show that SIGs can detectinfluence in complex systems that exhibit resourcecontention, loops and bidirectional influence, time-delayed effects, and asynchronous communication.In this paper, we present the SIG method and workthrough an example (Section 3); perform several controlledexperiments using a simulator (Section 4) to explore param-eters like message drop rate, timing noise, and number ofintermediate components; describe the central case study ofthe paper, how we took passively collected measurementsfrom two autonomous vehicles and built SIGs that enabledus to identify the source of a critical bug (Section 5); andbriefly present a significantly different second example byisolating a bug in a production supercomputer (Section 6).2 Related WorkThere is an extensive body of work on system modeling,especially on inferring the causal or dependency structureof distributed systems. Our method distinguishes itself fromprevious work in various ways, but primarily in that we lookfor influences rather than dependencies.Dependency graphs, or some probabilistic variant (e.g.,Bayesian networks), are frequently proposed for predic-tion and diagnosis of computer systems. There have beena number of recent attempts at dependency modeling indistributed systems. Pinpoint [7, 8] and Magpie [3] trackcommunication dependencies with the aim of isolating theroot cause of misbehavior; they require


View Full Document

Stanford CS 295 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?