DOC PREVIEW
Berkeley COMPSCI 294 - Recovery Oriented Computing

This preview shows page 1-2-3-4-28-29-30-31-58-59-60-61 out of 61 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Recovery Oriented ComputingOutlineGoals,Assumptions of last 15 yearsAfter 15 year improving PerformanceDowntime Costs (per Hour)Jim Gray: Trouble-Free SystemsLampson: Systems ChallengesHennessy: What Should the “New World” Focus Be?The real scalability problems: AMETotal Cost of Ownership (IBM)Lessons learned from Past Projects for which might help AMELessons learned from Past Projects for AMELessons learned from InternetLessons learned from Past Projects for AMELessons learned from Past Projects for AMELearning from other fields: PSTNLessons learned from Past Projects for AMELessons Learned from Other FieldsLessons Learned from Other FieldsLessons Learned from Other FieldsHuman ErrorHuman Error: Automation ironyOther FieldsLessons Learned from Other CulturesOutlineRecovery-Oriented Computing HypothesisTentative ROC Principles: #1 Isolation and RedundancyTentative ROC Principles #2 Online verificationTentative ROC Principles #3 Undo supportTentative ROC Principles #4 Diagnosis SupportOverview towards AME via ROCRest of TalkWhat about claims of 5 9s?“Microsoft fingers technicians for crippling site outages”What is uptime of HP.com?Traditional HA vs. Internet realityHow does ROC differ from Fault Tolerant Computing (FTC)?Benchmarking availabilityExample: single-fault in SW RAIDSoftware RAID: QoS behaviorSoftware RAID: QoS behaviorSoftware RAID: maintainabilityInitial ApplicationsConclusionAn Approach to Recovery-Oriented Computers (ROC)An Approach to ROCAn Approach to ROCAn Approach to ROCAn Approach to ROCISTORE-1 BrickCost of Bandwidth, SafetyDisk Limit: Bus HierarchyClusters and TPC Software 8/’00Clusters and TPC-C BenchmarkCost of Storage System v. DisksSCSI v. IDE $/GBAvailability benchmark methodologyStage 4: Diagnosis aidsDiagnosis aidsTotal Cost of OwnershipSlide 1Recovery Oriented ComputingDave PattersonUniversity of California at [email protected]://roc.CS.Berkeley.EDU/September 2001Slide 2Outline• What have we been doing• Motivation for a new Challenge: making things work (including endorsements)• What have we learned• New Challenge: Recovery-Oriented Computer• Examples: benchmarks, prototypesSlide 3Goals,Assumptions of last 15 years• Goal #1: Improve performance• Goal #2: Improve performance• Goal #3: Improve cost-performance• Assumptions– Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair)– Software will eventually be bug free (good programmers write bug-free code)– Hardware MTBF is already very large (~100 years between failures), and will continue to increaseSlide 4After 15 year improving Performance• Availability is now a vital metric for servers!– near-100% availability is becoming mandatory» for e-commerce, enterprise apps, online services, ISPs– but, service outages are frequent» 65% of IT managers report that their websites were unavailable to customers over a 6-month period• 25%: 3 or more outages– outage costs are high» social effects: negative press, loss of customers who “click over” to competitorSource: InternetWeek 4/3/2000Slide 5Downtime Costs (per Hour)• Brokerage operations $6,450,000• Credit card authorization $2,600,000• Ebay (1 outage 22 hours) $225,000• Amazon.com $180,000• Package shipping services $150,000• Home shopping channel $113,000• Catalog sales center $90,000• Airline reservation center $89,000• Cellular service activation $41,000• On-line network fees $25,000• ATM service fees $14,000Source: InternetWeek 4/3/2000+ Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."Slide 6Jim Gray: Trouble-Free Systems • Manager –Sets goals–Sets policy– Sets budget– System does the rest.• Everyone is a CIO (Chief Information Officer)• Build a system – used by millions of people each day– Administered and managed by a ½ time person.» On hardware fault, order replacement part» On overload, order additional equipment» Upgrade hardware and software automatically.“What Next? A dozen remaining IT problems”Turing Award Lecture, FCRC, May 1999Jim GrayMicrosoftSlide 7Lampson: Systems Challenges• Systems that work– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit• Credible simulations or analysis• Writing good specs• Testing• Performance– Understanding when it doesn’t matter“Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999Butler LampsonMicrosoftSlide 8Hennessy: What Should the “New World” Focus Be?• Availability– Both appliance & service• Maintainability– Two functions:» Enhancing availability by preventing failure» Ease of SW and HW upgrades• Scalability– Especially of service• Cost– per device and per service transaction• Performance– Remains important, but its not SPECint“Back to the Future: Time to Return to LongstandingProblems in Computer Systems?” Keynote address, FCRC, May 1999John HennessyStanfordSlide 9The real scalability problems: AME• Availability– systems should continue to meet quality of service goals despite hardware and software failures• Maintainability– systems should require only minimal ongoing human administration, regardless of scale or complexity: Today, cost of maintenance = 10X cost of purchase• Evolutionary Growth– systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded• These are problems at today’s scales, and will only get worse as systems growSlide 10Total Cost of Ownership (IBM)HW management 3%Environmental14%Downtime20%Purchase20%Administration 13%Backup Restore 30%•Administration: all people time•Backup Restore: devices, media, and people time•Environmental: floor space, power, air conditioningSlide 11Lessons learned from Past Projects for which might help AME• Know how to improve performance (and cost)– Run system against workload, measure, innovate, repeat– Benchmarks standardize workloads, lead to competition, evaluate alternatives; turns debates into numbers• Major improvements in Hardware Reliability– 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000– PC motherboards from 100,000 to 1,000,000 hours• Yet Everything has an error rate– Well designed and manufactured


View Full Document

Berkeley COMPSCI 294 - Recovery Oriented Computing

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Recovery Oriented Computing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Recovery Oriented Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Recovery Oriented Computing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?