Complexity revisited: learning from failures Frans Kaashoek Lec 26 --- Last one! 5/16/07 Credit: Jerry Saltzer6.033 in one slide • Client/server • RPC • File abstraction • Virtual memory • Threads • Coordination • Protocol layering • Routing protocols • Reliable packet delivery • Names • Replication protocols • Transactions • Verify/Sign • Encrypt/Decrypt • ACL and capabilities • Speaks for Case studies of successful systems: LISP, UNIX, X Windows, MapReduce, Ethernet, Internet, WWW, RAID, DNS, ….!Principles:End-to-end argument, Open Design, …!Today: Why do systems fail anyway? • Complexity in computer systems has no hard edge • Learning from failures: common problems • Fighting back: avoiding the problems • Admonition & 6.033 theme songToo many objectives • Ease of use • Availability • Scalability • Flexibility • Mobility • Security • Networked • Maintainability • Performance • Durable • …. Lack systematic methods!Many objectives + Few Methods + High d(technology)/dt = Very high risk of failure The tarpit![Brooks, Mythical Man Month]Complexity: no hard edge • It just gets worse, worse, and worse … Increasing function!Subjective!complexity!Learn from failure “The concept of failure is central to design process, and it is by thinking in terms of obviating failure that successful designs are achieved…” [Petroski]Keep digging principle • Complex systems systems fail for complex reasons – Find the cause … – Find a second cause … – Keep looking … – Find the mind-set. [Petroski, Design Paradigms]Try 1: Meidum (52° angle)!Try 2: Dashur/Bent !(52° to 43.5° angle)!Try 3: Red pyramid (right angle: 43°)!Pharaoh Sneferu’s Pyramid project!United Airlines/Univac • Automated reservations, ticketing, flight scheduling, fuel delivery, kitchens, and general administration • Started 1966, target 1968, scrapped 1970, spend $50M • Second-system effect (First: SABRE) (Burroughs/TWA repeat)CONFIRM • Hilton, Marriott, Budget, American Airlines • Hotel reservations linked with airline and car rental • Started 1988, scrapped 1992, $125M • Second system • Dull tools (machine language) • Bad-news diode [Communications of the ACM 1994]IBM Workplace OS for PPC • Mach 3.0 + binary compatability with AIX + DOS, MacOS, OS/400 + new clock mgmt + new RPC + new I/O + new CPU • Started in 1991, scrapped 1996 ($2B) • 400 staff on kernel, 1500 elsewhere • “Sheer complexity of class structure proved to be overwhelming” • Inflexibility of frozen class structure • Big-endian/Little-endian not solved [Fleish HotOS 1997]Advanced Automation System • US Federal Aviation Administration • Replaces 1972 Air Route Traffic Control System • Started 1982, scrapped 1994 ($6B) • All-or-nothing • Changing specifications • Grandiose expectations • Contract monitors viewed contractors as adversaries • Congressional meddlingLondon Ambulance Service • Ambulance dispatching • Started 1991, scrapped in1992 (20 lives lost in 2 days, 2.5M) • Unrealistic schedule (5 months) • Overambitious objectives • Unidentifiable project manager • Low bidder had no experience • No testing/overlap with old system • Users not consulted during design [Report of the Inquiry Into The London Ambulance Service 1993]More, too many to list … • Portland, Oregan, Water Bureau, 30M, 2002 • Washington D.C., Payroll system, 34M 2002 • Southwick air traffic control system $1.6B 2002 • Sobey’s grocery inventory, 50M, 2002 • King’s County financial mgmt system, 38M, 2000) • Australian submarine control system, 100M, 1999 • California lottery system, 52M • Hamburg police computer system, 70M, 1998 • Kuala Lumpur total airport management system, $200M, 1998 • UK Dept. of Employment tracking, $72M, 1994 • Bank of America Masternet accounting system, $83M, 1988, • FBI virtual case, 2004. • FBI Sentinel case management software, 2006.Recurring problems • Excessive generality and ambition • Bad ideas get included • Second-system effect • Mythical Man Month • Wrong modularity • Bad-news diode • Incommensurate scalingFighting back: control novelty • Source of excessive novelty: – Second-system effect – Technology is better – Idea worked in isolation – Marketing pressure • Some novelty is necessary; the difficult part is saying No. • Don’t be afraid to re-use existing components – Don’t reinvent the wheel – Even if it takes some massagingFighting back: adopt sweeping simplifications • Processor, Memory, Communication • Dedicated servers • N-level memories • Best-effort network • Delegate administration • Fail-fast, pair-and-compare • Don’t overwrite • Transactions • Sign and encryptFighting back: design for iteration, iterate the design • Something simple working soon – Find out what the real problems are • One new problem at a time • Use iteration-friendly design – E.g., Failure/attack models “Every successful complex system is found to have evolved from a successful simple system”Fighting back: find bad ideas fast • Question requirements – “And ferry itself across the Atlantic” [LHX light attack helicoper] • Try ideas out, but don’t hesitate to scrap • Understand the design loop Requires strong, knowledgeable managementThe design loop • Find flaws fast! Initial!design!Draft design! coding! testing! deployed!months!min!hours!days!weeks!Fighting back: find flaws fast • Plan, plan, plan (CHIPS, Intel processors) • Simulate, simulate, simulate – Boeing 777 and F-16 • Design reviews, coding reviews, regression tests, daily/hourly builds, performance measurements • Design the feedback system: – Alpha and beta tests – Incentives, not penalties, for reporting errorsFighting back: conceptual integrity • One mind controls the design – Macintosh – Visicalc spreadsheet – UNIX – Linux • Good esthetics yields more successful systems – Parsimonious, Orthogonal, Elegant, Readable, … • Few
View Full Document