Fault-tolerant Computing6.033 Spring 2007April 4, 2007Where are we in 6.033?•Modularity to control complexity•Names are the glue to compose modules•Strong form of modularity: client/server•Limit propagation of errors•Implementations of client/server:•In a single computer using virtualization•In a network using protocols•Compose clients and services using names•DNSHow to respond to failures?•Failures are contained; they don’ t propagate•Benevolent failures•Can we do better?•Keep computing despite failures?•Defend against malicious failures (attacks)?•Rest of semester: handle these “ failures”•Fault-tolerant computing•Computer securityFault-tolerant computing•General introduction: today•Replication/Redundancy•The hard case: transactions•updating permanent data in the presence of concurrent actions and failures•Replication revisited: consistencyAvailability in practice•Carrier airlines (2002 FAA fact book)•41 accidents, 6.7M departures99.9993% availability•911 Phone service (1993 NRIC report)•29 minutes per line per year99.994%•Standard phone service (various sources)•53+ minutes per line per year99.99+%•End-to-end Internet Availability95% - 99.6%Disk failure conditional probability distributionExpected operating lifetime1 / (reported MTTF)InfantmortalityBurn outBathtub curveHuman Mortality Rates(US, 1999)From: L. Gavrilov & N. Gavrilova, “ Why We Fall Apart,” IEEE Spectrum, Sep. 2004.Data from http://www.mortality.orgFail-fast diskfailfast_get (data, sn) {get (s, sn);if (checksum(s.data) = s.cksum) {data ← s.data;return OK;} else {return BAD;}}Careful diskcareful_get (data, sn) {r ← 0;while (r < 10) {r ← failfast_get (data, sn);if (r = OK) return OK;r++;}return BAD;}Durable disk (RAID 1)durable_get (data, sn) {r ← disk1.careful_get (data, sn);if (r = OK) return OK;r ← disk2.careful_get (data, sn);signal(repair disk1);return
View Full Document