Unformatted text preview:

Page 1ECE 254 / CPS 225Fault Tolerant and Testable Computing SystemsIntroductionCopyright 2004 Daniel J. SorinDuke UniversityECE 254 / CPS 225 2(C) 2004 Daniel J. SorinGeneral Course Information• Professor: Daniel J. Sorin– [email protected]– http://www.ee.duke.edu/~sorin• Course info– http://www.ee.duke.edu/~sorin/ece254/• Office hours– 1111 Hudson Hall– Times??• Prerequisite– ECE 152 or CPS 104 or knowing how a computer worksECE 254 / CPS 225 3(C) 2004 Daniel J. SorinCourse Objectives• Learn about fault-tolerant computer systems– Both hardware and software (slight emphasis on hardware)• Learn how to read/evaluate research papers• Learn how to perform researchECE 254 / CPS 225 4(C) 2004 Daniel J. SorinCourse Guidelines• Students are responsible for:– Homework - 35% of grade– Midterm exam - 20% of grade– Final exam - 25% of grade– Individual or group project - 20% of gradePage 2ECE 254 / CPS 225 5(C) 2004 Daniel J. SorinProject• The project is a half-semester-long assignment– Written proposal (no more than 3 pages), due Oct 21– Written progress report (<= 3 pages), Nov 11– Final document in conference/journal format (<= 10 pages), Nov 30– Final presentation (in class), Nov 30 / Dec 2• Groups of 2 or 3 are OK• Get started early! Talk to me about project ideas.– In case you missed it: GET STARTED EARLY!ECE 254 / CPS 225 6(C) 2004 Daniel J. SorinAcademic Misconduct• I will not tolerate academically dishonest work. This includes cheating on homework or exams and plagiarism on the project.• Be careful on the project to cite prior work and to give proper credit to others' research.• Ask me if you have any questions. Not knowing the rules does not make misconduct OK!ECE 254 / CPS 225 7(C) 2004 Daniel J. SorinCourse Topics• Introduction– Terminology and metrics– Faults and their causes• General Fault Tolerance Concepts– Redundancy (spatial, temporal, information, etc.)– Error detection– Forward recovery & backward recovery• Applying Theory to Real Systems– Hardware– Software• Modeling/Evaluation• Testing and Design for TestECE 254 / CPS 225 8(C) 2004 Daniel J. SorinOutline (of Intro)• Motivation for Fault Tolerance• Goals of Fault Tolerant Computing• Challenges (why this isn’t easy)• Some ExamplesPage 3ECE 254 / CPS 225 9(C) 2004 Daniel J. SorinMotivation• Extreme fault tolerance has always been around– NASA’s deep space probes– Medical computing devices (e.g., pacemakers)– But this had been a niche market until recently• But now fault tolerance is becoming more important– More reliance on computers• Extreme fault tolerance– Car controllers (e.g., anti-lock brakes), etc.• High fault tolerance– Commercial servers (databases, web servers), file servers, etc.• Some fault tolerance – Desktops, laptops (really!), PDAs, etc.ECE 254 / CPS 225 10(C) 2004 Daniel J. SorinSh*t Happens• We wouldn’t need fault tolerance otherwise!• Physical problems– Melted wire, toasted chip, etc.• Design flaws– Incorrect logic (e.g., Pentium’s FDIV), buggy software• Operator error– Incorrect software installation, accidental use of rm –R *ECE 254 / CPS 225 11(C) 2004 Daniel J. SorinFailures During Lifetime• Three phases of system lifetime– Infant mortality– Normal lifetime– Wear-out period• Failures follow famous “bathtub curve”~20 weeks5-25 yearsFrom Figure 1-4 in textbookFailure rateECE 254 / CPS 225 12(C) 2004 Daniel J. SorinOutline• Motivation for Fault Tolerance• Goals of Fault Tolerant Computing• Challenges (why this isn’t easy)• Some ExamplesPage 4ECE 254 / CPS 225 13(C) 2004 Daniel J. SorinGoals of Fault Tolerant Systems• How can we deal with problems?• Option 1: Make problems less likely– Tough to do!– Testing and design for test (DFT) can help avoid physical defects– Careful design reviews can help avoid design flaws– Training and practice can help avoid operator error• Option 2: Fail, but don’t corrupt anything– Example: ATM should shut down instead of passing out money• Option 3: Transparently tolerate problems– Use hardware and/or software to mask fault effects– Key: use redundancy (a.k.a. spares or backups)– Example: having a co-pilot on an airplaneECE 254 / CPS 225 14(C) 2004 Daniel J. SorinMetrics of Goodness: Reliability• Reliability: R(t) = probability that the system has been operating correctly and continuously from time 0 until time t, given that it was operating correctly at time 0• Useful for measuring systems that can’t be repaired or that will cause a catastrophe if they fail– Examples: satellites, pacemakers• One related measurement of reliability is the Mean Time To Failure (MTTF)• Related metric is Failures in Time (FIT)– FIT = number of failures per 109hours– (No, I don’t know where this name came from, or why 109hours)ECE 254 / CPS 225 15(C) 2004 Daniel J. SorinMetrics of Goodness: Availiability• Availability: A(t) = probability that the system is operating correctly at time t• Useful for measuring systems that can be repaired or that aren’t mission critical– Examples: file servers, desktops• Availability = MTTF / (MTTF + MTTR)– MTTR = Mean Time To Repair– Mean Time Between Failures (MTBF): MTBF := MTTF + MTTR• One unit of measurement is the “number of nines”– E.g., 5 nines means that A(t) = 0.99999 ECE 254 / CPS 225 16(C) 2004 Daniel J. SorinOther Metrics of Goodness• Safety: won’t fail in a dangerous way• Maintainability: ease of maintaining system– Somewhat vague metric• Testability: ease of testing system– Can quantify it by how many possible faults we can test for– Will cover this in more detail towards end of semester• Dependability: ???– A truly vague term, which can be useful, e.g., the InternationalSymposium on Dependable Systems and Networks (DSN)• Lots of –abilities … few of which are clearly and consistently definedPage 5ECE 254 / CPS 225 17(C) 2004 Daniel J. SorinOutline• Motivation for Fault Tolerance• Goals of Fault Tolerant Computing• Challenges (why this isn’t easy)• Some ExamplesECE 254 / CPS 225 18(C) 2004 Daniel J. SorinWhy This Isn’t Easy• Fault tolerance can be solved to any arbitrary degree if you’re willing to throw resources at the problem• Resources:– System performance– Cost– Power• Example: laptop running PowerPoint– Buy 2 different laptops (Pentium 4 and an iMac)– Run them


View Full Document

Duke ECE 254 - Introduction

Download Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?