Stanford CS 295 - Study Notes - D801591

Home> Schools> Stanford University> Computer Science (CS) > CS 295> Study Notes

DOC PREVIEW

Stanford CS 295 - Study Notes

School name Stanford University

Course Cs 295- Software Engineering

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Using Correlated Surprise to Infer Shared InfluenceAdam J. Oliner, Ashutosh V. Kulkarni, and Alex AikenStanford University∗Department of Computer Science{oliner, ashutosh.kulkarni, aiken}@cs.stanford.eduAbstractWe propose a method for identifying the sources of prob-lems in complex production systems where, due to the pro-hibitive costs of instrumentation, the data available foranalysis may be noisy or incomplete. In particular, we maynot have complete knowledge of all components and theirinteractions. We define influences as a class of componentinteractions that includes direct communication and re-source contention. Our method infers the influences amongcomponents in a system by looking for pairs of componentswith time-correlated anomalous behavior. We summarizethe strength and directionality of shared influences usinga Structure-of-Influence Graph (SIG). This paper explainshow to construct a SIG and use it to isolate system mis-behavior, and presents both simulations and in-depth casestudies with two autonomous vehicles and a 9024-node pro-duction supercomputer.1 IntroductionConsider a complex production system in which some-thing goes wrong: a performance glitch, a strange result, oran outright crash. How might we identify the source of theproblem? A fundamental difficulty is that the costs of in-strumentation in production systems are often prohibitive.Significant systems are invariably constructed from manyinteracting subsystems, and we cannot expect to have mea-surements from every component. In fact, in many systemswe will not even know of all the components or of the in-teractions among the components we do know. This paperis about analyzing systems as they are, generating a poten-tially partial diagnosis from whatever data is available.Our method requires only that some of the components inthe system are instrumented to generate timestamped mea-surements of their behavior. The type of measurements maydepend on the type of component (e.g., a laser sensor maybe instrumented differently than a hard disk). Thus, we need∗This work was supported in part by NSF grant CCF-0915766 and theDOE High-Performance Computer Science Fellowship.a way to compare measurements of different components ina uniform way. We address this issue, and the related ques-tion of how to summarize different kinds of measurementsfrom a single component, by mapping all components’ be-havior to a single dimension: surprise. That is, our methodquantifies how anomalous individual component behavioris, as an anomaly signal, using deviation from a model ofnormal component behavior. An important feature of ouranomaly signals is that they are real-valued, meaning thatthe degree to which a component’s behavior is anomalousis retained, rather than the common approach of discretizingbehavior into “normal” and “abnormal”.When two anomaly signals are correlated, meaning thattwo components tend to exhibit surprising behavior aroundthe same time, we say that the components share an influ-ence. This correlation can arise from a number of interac-tions, including direct communication and contention for ashared resource. Not all interactions are instantaneous, sowe use effect delays—how long it tends to take an anomalyin one component to manifest itself in another—to establishdirectionality. Correlation is a pairwise relationship and de-lay is directional, so the most natural structure to summarizeinfluence is a graph. A Structure-of-Influence Graph (SIG)encodes strong influence as an edge between components,with optional directionality to represent a delay.Passively collected data, if devoid of hints like “compo-nent A sent a message to component B,” cannot be used toinfer causality: the strongest possible mathematical state-ment is that the behavior of one component is correlatedwith another. An advantage of using statistical correlationis that it enables asking “what-if” queries, after the fact.For example, it is easy to add a new “component” whoseanomaly signal is large around the time bad behavior wasobserved. Other, real, components that share influence withthe synthetic component are likely candidates for contribu-tors to the problem.Our goal is to generate a structure, informed by modelsof component behavior, that enables a user to more eas-ily answer prediction and diagnosis questions. The SIGmethod has several desirable properties:• Building a SIG requires no intrusive instrumenta-tion; no expert knowledge of the components; and noknowledge about communication channels (e.g., thedestination of a message), shared resources, or mes-sage content. Our method is passive and can treat com-ponents as black boxes.• Influence describes correlation, not causality. A keyfeature of our approach is to drop the assumption thatwe can observe all component interactions and focuson the correlations among behaviors we can observe(see Section 2).• By working directly with a real-valued, rather than bi-nary, anomaly signal, our method degrades gracefullywhen data is noisy or incomplete.• Our experimental results show that SIGs can detectinfluence in complex systems that exhibit resourcecontention, loops and bidirectional influence, time-delayed effects, and asynchronous communication.In this paper, we present the SIG method and workthrough an example (Section 3); perform several controlledexperiments using a simulator (Section 4) to explore param-eters like message drop rate, timing noise, and number ofintermediate components; describe the central case study ofthe paper, how we took passively collected measurementsfrom two autonomous vehicles and built SIGs that enabledus to identify the source of a critical bug (Section 5); andbriefly present a significantly different second example byisolating a bug in a production supercomputer (Section 6).2 Related WorkThere is an extensive body of work on system modeling,especially on inferring the causal or dependency structureof distributed systems. Our method distinguishes itself fromprevious work in various ways, but primarily in that we lookfor influences rather than dependencies.Dependency graphs, or some probabilistic variant (e.g.,Bayesian networks), are frequently proposed for predic-tion and diagnosis of computer systems. There have beena number of recent attempts at dependency modeling indistributed systems. Pinpoint [7, 8] and Magpie [3] trackcommunication dependencies with the aim of isolating theroot cause of misbehavior; they require

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Stanford CS 295 - Study Notes

Sign up for free to view:

Please select your school