Unformatted text preview:

iKen BirmanCornell University. CS5410 Fall 2008. State Machines: Historyy Idea was first proposed by Leslie Lamport in 1970’sy Builds on notion of a finite‐state automatonyWe model the program of interest as a black box with inputs such as timer events and messagesyAssume that the program is completely deterministicyAssume that the program is completely deterministicy Our goal is to replicate the program for fault‐toleranceySo: make multiple copies of the state machineSo: make multiple copies of the state machiney Then design a protocol that, for each event, replicates the event and delivers it in the same order to each copyy The copies advance through time in synchronyState MachineProgramin state StEvent eProgramin state in state St+1State MachineEvent eReplica GroupProgramin state StProgramin state StProgramin state StProgramin state Programin state Programin state in state St+1in state St+1in state St+1A simple fault‐tolerance concepty We replace a single entity P with a set y Now our set can tolerate faults that would have caused P idi iP to stop providing servicey Generally, thinking of hardware faultsySoftware faults might impact all replicas in lockstep!ySoftware faults might impact all replicas in lock‐step!ySide discussion:ySide discussion:y Why do applications fail? H ardware? Software? (Sidebar) Why do systems fail?y A topic studied by many researchersy They basically concluded that bugs are the big issuey Even the best software, coded with cleanroomtechniques, will exhibit significant bug rates yHardware an issue too of course!yHardware an issue too, of course!y Sources of bugs?y Poor coding, inadequate testingoo cod g, adequate test gy Vague specifications, including confusing documentation that was misunderstood when someone hd d ii had to exte nd a pre‐existing systemy Bohrbugs and Heisenbugs(Sidebar) Why do systems fail?y Bohrbug:y Term reminds us of Bohr’s model of the nucleus:ldlly A solid little nuggety If you persist, you’ll manage to track it downyLike a binary searchyLike a binary search(Sidebar) Why do systems fail?y Heisenbug:y Term reminds us of Heisenberg’ s model of the nucleus:f ’ k bhl dy A wave function: can’t know both location and momentumy Every time you try to test the program, the test seems to change its behaviorchange its behaviory Often occurs when the “bug” is really a symptom of some much earlier problemMost studies?y Early systems dominated by Bohrbugsy Mature systems show a mixy Many problems introduced by attempts to fix other bugsy Persistent bugs usually of Heisenbug varietyO l id di it ft yOver long periods, upgrading environment can often destabilize a legacy system that worked perfectly wellyCloud scenarioCloud scenarioy “Rare” hardware and environmental events are actually very common in huge data centersDeterminism assumptiony State machine replication isy Easy to understandy Relatively easy to implementy Used in a CORBA “fault‐tolerance” standardyBt th b f kd tiyBut there are a number of awkward assumptionsy Determinism is the first of thesey Question: How deterministic is a modern application, coded in a language such as Java?coded in a language such as Java?Sources of non‐determinismy Thre ads and thread scheduling (parallelism)y Precise time when an interrupt is delivered, or when user input will be processedinput will be processedy Values read from system clock, or other kinds of operating system managed resources (like process status data, CPU y g ( p ,load, etc)y If multiple messages arrive on multiple input sockets, the d i hi h th ill b b th order in which they will be seen by the processy When the garbage collector happens to runy “Constants” like my IP address or port numbers assigned Constants like my IP address, or port numbers assigned to my sockets by the operating systemNon‐determinism explains pHeisenbug problemsy Many Heisenbugs are just vanilla bugs, buty They occur early in the executionyAnd they damage some data structurey The application won’t touch that structure until much later when some nondeterministic thing happenslater, when some non‐deterministic thing happensy But then it will crashySo the crash symptoms vary from run to runySo the crash symptoms vary from run to runy People on the “sustaining support” team tend to try and fix the symptoms and often won’t understand code well enough to understand the true cause(Sidebar) Life of a programy Coded by a wizard who really understood the logicy But she moved to other projects before finishingy Handed off to Q/Ay Q/A did a reasonable job, but worked with inadequate test suite so coverage was spottytest suite so coverage was spottyy For example, never tested clocks that move backwards in time, or TCP connections that break when both ends ,are actually still healthyy In field, such events DO occur, but attempts to fix h j dd d li d b!them just added complexity and more bugs!Overcoming non ‐determinismy One option: disallow non‐determinismy This is what Lamport did, and what CORBA does tooy But how realistic is it?yWorry: what if something you use “encapsulates” a nonyWorry: what if something you use encapsulates a non‐deterministic behavior, unbeknownst to you?y Modern development styles: big applications created p y g ppfrom black box components with agreed interfacesy We lack a “test” for determinism!Overcoming non ‐determinismy Another option: each time something non‐deterministic is about to happen, turn it into an eventF l h d h yFor example, suppose that we want to read the system clockyIf we simply read it ever y replica gets different resultyIf we simply read it, every replica gets different resulty But if we read one clock and replicate the value, they see the same resulty Trickier: how about thread scheduling?y With multicore hardware, the machine itself isn’t ddeterministic!More issuesy For input from the network, or devices, we need some kind of rela y mechanismShi h d h k h diySomething that reads the network, or the devicey Then passes the events to the group of replicasy The rela y mechanism itself won’t be fault‐tolerant: should this worry us?yy For example, if we want to rela y something typed by a user, it starts at a single place (his keyboard)Implementing event replicationy One option is to use a protocol like the Oracle


View Full Document

CORNELL CS 5410 - State Machine Concept

Download State Machine Concept
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view State Machine Concept and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view State Machine Concept 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?