DOC PREVIEW
Berkeley COMPSCI 294 - Lecture Notes

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 294-8 Principles of Fault Tolerant Computing Kathy Yelick http://www.cs.berkeley.edu/~yelick/294Today’s AgendaUbiquitous ComputingComputing LandscapeThe “Post-PC" EraThe problem space: big dataApplication: Tornado ResponseApplication: Smart BuildingsApplication: EarthquakesApplication: TransportationSummary of Post-PC EraSlide 12eBay CrashEECS Servers CrashAriane 5 Rocket CrashThe Therac-25 FailureTele DenmarkLampson: Systems ChallengesHennessy: The “New World” FocusSlide 20Aspects of ReliabilityAvailability vs. ReliabilityFault RecoveryTwo Keys to AvailabilityCauses of Faults: TandemCauses of Faults: VaxThe Fail-Fast PrincipleThe Heisenbug HypothesisSlide 29What is Fault Tolerance?What is Meant by “Principles”?Course GoalsAdministriviaThis WeekCS294, Yelick Introduction, p1CS 294-8Principles of Fault Tolerant ComputingKathy Yelickhttp://www.cs.berkeley.edu/~yelick/294CS294, Yelick Introduction, p2Today’s Agenda•Motivation and trends•Examples of failures•Background in reliability computing•Course Overview•AdministriviaCS294, Yelick Introduction, p3Ubiquitous Computing•Computing everywhere:–Desktop, Laptop, Palmtop, Cars, Cellphones•Input devices everywhere:–Sensors, cameras, microphones•Connectivity everywhere:–Rapid growth of bandwidth in the interior of the net–Broadband to the home and office–Wireless technologies such as CMDA, Satelite, laser•Increased reliance on computers is inevitable•Computer systems will become invisible only when they are reliableCS294, Yelick Introduction, p4Computing LandscapeScalable, AvailableInternet ServicesInfo. appliancesClientServerSensorsCuller, 1999CS294, Yelick Introduction, p5The “Post-PC" EraPostPC Era Divides built on two technologies:1) Mobile Consumer Electronic Devices–e.g., successor to PDA, Cell phone, wearable computers2) Infrastructure to Support such Devices–e.g., successor to Big Fat Web Servers, DatabasesCS294, Yelick Introduction, p6The problem space: big data•Big demand for enormous amounts of data–today: enterprise and internet applications•online applications: e-commerce, mail, web, archives•enterprise decision-support, data mining databases–future: richer data and more of it•computational & storage back-ends for mobile devices•more multimedia content•more use of historical data to provide better services•Two key application domains:–storage: public, private, and institutional data–search: building static indexes, dynamic discoveryCS294, Yelick Introduction, p7Application: Tornado Response•CAPS at University of Oklahoma•Currently 12 radars in Oklahoma area–Improve warning time: saved 800 lives?•Two problems:–Real-time computation and response•Some local to one radar•Better algorithms involve coordination–Archival of data for experimentation and long term analyses (“data mining”)•Petabytes per yearCS294, Yelick Introduction, p8Application: Smart Buildings•Buildings adapt to occupants and save energy–Save $55 billion in the U.S.–Reduce carbon emissions by 35 million metric tons•Sensors with wireless connections•Integrated with server to record history information and do predictionCS294, Yelick Introduction, p9Application: Earthquakes•Reduce the risk and improve response to earthquakes•Use millions of MEMS sensors in buildings, ground, bridges, etc.•Front-end processing in sensor•Tied to backend data bases and computational models•Building will “self-diagnose” after an earthquake.CS294, Yelick Introduction, p10Application: Transportation•Traveler information service–Limited for exists in Europe–Prototype from Path project in LA•“Mine” sensor data from roads to predict travel times•Traffic manage apply controls (traffic ramp meters) in real time•Improve long term highway planningCS294, Yelick Introduction, p11Summary of Post-PC Era•Computing and data in the extremes: tiny devices and enormous “utility-style” servers•Applications entertainment and business will continue•New applications that make computing transparent in the environment require reliabilityCS294, Yelick Introduction, p12Today’s Agenda•Motivation and trends•Examples of failures•Background in reliability computing•Course Overview•AdministriviaCS294, Yelick Introduction, p13eBay Crash•eBay: giant internet auction house–A top 10 internet business–Market value of $22 billion–3.8 million users as of March 1999–Bidding allowed 24x7•June 6, 1999–eBay system is unavailable for 22 hours with problems ongoing for several days–Stock drops by 6.5%, $3-5 billion lost revenues–Problems blamed on Sun server software•Similar to EECS server downtime?•Shorter downtimes commonCS294, Yelick Introduction, p14EECS Servers Crash•Department servers are offline for 3-7 days–Cause is disk failure coupled with incompatible RAID software•Power failure also resulted in lost data in a separate event–UPS not purchased due to staff turnoverCS294, Yelick Introduction, p15Ariane 5 Rocket Crash•Ariane 5 and its payload destroyed about 40 seconds after liftoff•Error due to software bug:–Conversion of floating point to 16-bit int–Out of range error generated but not handled•Testing of full system under actual conditions not done due to budget limits •Estimated cost: 120 million DMRisks DigestCS294, Yelick Introduction, p16The Therac-25 Failure•Therac-25 is a linear accelerator used for radiation therapy•More dependent on software for safety than predecessors (Therac-20, Therac-6) •Machine reliably treated thousands of patients, but occasionally there were serious accidents, involving major injuries and 1 death.•Software problems:–No locks on shared variables (race conditions).–Timing sensitivity in user interface.–Wrap-around on counters.Fox and Dill, 1999CS294, Yelick Introduction, p17Tele Denmark•Tele Denmark Internet, ISP •August 31, 1999–Internet service down for 3 hours–Truck drove into the power supply cabinet at Tele Denmark–Where were the UPSs?Risks Digest & rec.humor•Old ones had been disconnected for upgrade•New ones were on the truck!CS294, Yelick Introduction, p18Lampson: Systems Challenges•Systems that work–Meet their specs–Always available–Adapt to environment & evolve over time–Made from unreliable components–Grow without practical limit•Credible simulations or analysis•Writing good specs•Testing•Performance–Understanding when it


View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?