DOC PREVIEW
Berkeley COMPSCI 294 - Recovery Oriented Computing

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Recovery Oriented ComputingDave PattersonUniversity of California at [email protected]://roc.CS.Berkeley.EDU/September 2001Slide 2Outline• What have we been doing• Motivation for a new Challenge: making things work (including endorsements)• What have we learned• New Challenge: Recovery-Oriented Computer• Examples: benchmarks, prototypesSlide 3Goals,Assumptions of last 15 years• Goal #1: Improve performance• Goal #2: Improve performance• Goal #3: Improve cost-performance• Assumptions– Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair)– Software will eventually be bug free (good programmers write bug-free code)– Hardware MTBF is already very large (~100 years between failures), and will continue to increaseSlide 4After 15 year improving Performance• Availability is now a vital metric for servers!– near-100% availability is becoming mandatory» for e-commerce, enterprise apps, online services, ISPs– but, service outages are frequent» 65% of IT managers report that their websites were unavailable to customers over a 6-month period• 25%: 3 or more outages– outage costs are high» social effects: negative press, loss of customers who “click over” to competitorSource: InternetWeek 4/3/2000Slide 5Downtime Costs (per Hour)• Brokerage operations $6,450,000• Credit card authorization $2,600,000• Ebay (1 outage 22 hours) $225,000• Amazon.com $180,000• Package shipping services $150,000• Home shopping channel $113,000• Catalog sales center $90,000• Airline reservation center $89,000• Cellular service activation $41,000• On-line network fees $25,000• ATM service fees $14,000Source: InternetWeek 4/3/2000+ Fibre Channel: A Comprehensive Introduction, R. Kembel2000, p.8. ”...based on a survey done by Contingency Planning Research."Slide 6Jim Gray: Trouble-Free Systems • Manager –Sets goals–Sets policy– Sets budget– System does the rest.• Everyone is a CIO (Chief Information Officer)• Build a system – used by millions of people each day– Administered and managed by a ½ time person.» On hardware fault, order replacement part» On overload, order additional equipment» Upgrade hardware and software automatically.“What Next? A dozen remaining IT problems”Turing Award Lecture, FCRC, May 1999Jim GrayMicrosoftSlide 7Lampson: Systems Challenges• Systems that work– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit• Credible simulations or analysis• Writing good specs• Testing• Performance– Understanding when it doesn’t matter“Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999Butler LampsonMicrosoftSlide 8Hennessy: What Should the “New World” Focus Be?• Availability– Both appliance & service• Maintainability–Two functions:» Enhancing availability by preventing failure» Ease of SW and HW upgrades• Scalability– Especially of service• Cost– per device and per service transaction• Performance– Remains important, but its not SPECint“Back to the Future: Time to Return to LongstandingProblems in Computer Systems?” Keynote address, FCRC, May 1999John HennessyStanfordSlide 9The real scalability problems: AME• Availability– systems should continue to meet quality of service goals despite hardware and software failures• Maintainability– systems should require only minimal ongoing human administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase• Evolutionary Growth– systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded• These are problems at today’s scales, and will only get worse as systems growSlide 10Total Cost of Ownership (IBM)HW management 3%Environmental14%Downtime20%Purchase20%Administration 13%Backup Restore 30%•Administration: all people time•Backup Restore: devices, media, and people time•Environmental: floor space, power, air conditioningSlide 11Lessons learned from Past Projects for which might help AME• Know how to improve performance (and cost)– Run system against workload, measure, innovate, repeat– Benchmarks standardize workloads, lead to competition, evaluate alternatives; turns debates into numbers• Major improvements in Hardware Reliability– 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000– PC motherboards from 100,000 to 1,000,000 hours• Yet Everything has an error rate– Well designed and manufactured HW: >1% fail/year– Well designed and tested SW: > 1 bug / 1000 lines– Well trained people doing routine tasks: 1%-2%– Well run collocation site (e.g., Exodus): 1 power failure per year, 1 network outage per yearSlide 12Lessons learned from Past Projects for AME• Maintenance of machines (with state) expensive– ~5X to 10X cost of HW– Stateless machines can be trivial to maintain (Hotmail)• System admin primarily keeps system available– System + clever human working during failure = uptime– Also plan for growth, software upgrades, configuration, fix performance bugs, do backup• Software upgrades necessary, dangerous– SW bugs fixed, new features added, but stability?– Admins try to skip upgrades, be the last to use oneSlide 13Lessons learned from Internet• Realities of Internet service environment:– hardware and software failures are inevitable» hardware reliability still imperfect» software reliability thwarted by rapid evolution» Internet system scale exposes second-order failure modes– system failure modes cannot be modeled or predicted» commodity components do not fail cleanly» black-box system design thwarts models» unanticipated failures are normal– human operators are imperfect» human error accounts for ~50% of all system failuresSources: Gray86, Hamilton99, Menn99, Murphy95, Perrow99, Pope86Slide 14 Cause of System Crashes 20%10%5%50%18%5%15%53%69%15%18%21%0%20%40%60%80%100%1985 1993 2001Other: app, power, network failureSystem management: actions + N/problemOperating SystemfailureHardware failure(est.)Lessons learned from Past Projects for AME• Failures due to people up, hard to measure– VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01– HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%?– How get administrator to


View Full Document

Berkeley COMPSCI 294 - Recovery Oriented Computing

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Recovery Oriented Computing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Recovery Oriented Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Recovery Oriented Computing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?