Duke ECE 254 - Fault Tolerant and Testable Computing Systems

Unformatted text preview:

Page 1ECE 254 / CPS 225Fault Tolerant and Testable Computing SystemsFaults and Their CausesCopyright 2004 Daniel J. SorinDuke UniversityECE 254 / CPS 225 2(C) 2004 Daniel J. SorinOutline• Intro and Terminology• Causes of Faults• Fault Models• Two Research PapersECE 254 / CPS 225 3(C) 2004 Daniel J. SorinWhy We Study Faults• Know thy enemy!• If we don’t understand faults, it is much more difficult to design systems that can tolerate them– We at least have to know how to model themECE 254 / CPS 225 4(C) 2004 Daniel J. SorinFaults• Fault: incorrect state of hardware or software resulting from physical defect, design flaw, or operator error• Faults introduced during system design– Pentium’s incorrect floating point division design– Bug in software could cause infinite loop• Faults introduced during manufacturing– Bad solder connection between chip pin and motherboard– Broken wire within chip• Faults that occur during operation– Cosmic ray knocks charge off DRAM cell– System administrator incorrectly installs new softwarePage 2ECE 254 / CPS 225 5(C) 2004 Daniel J. SorinErrors• Error: manifestation of a fault– Bit in main memory is a 0 instead of a 1 (due to cosmic ray)– Software pointer that mistakenly points to NULL (due to bug)• But not all faults lead to errors!– Trees falling in empty forests don’t make sounds• Examples of masked faults– Cosmic ray knocks charge off logic signal, but after it had beencorrectly latched in and saved– Buggy software that isn’t reachedECE 254 / CPS 225 6(C) 2004 Daniel J. SorinFailures• Failure: system level effect of an error (user-visible)– System produces incorrect result of computation (e.g., 2+2=5)– System “hangs” (e.g., Blue Screen of Death)• Not all errors lead to failures!• Examples of masked errors– Bit flip in memory location that’s not accessed again– NULL pointer that’s not referenced againECE 254 / CPS 225 7(C) 2004 Daniel J. SorinFault Æ Error Æ Failure Examples• Cosmic ray knocks charge off of DRAM cellÆ Error: bit flip in memoryÆ Failure: computation produces incorrect result• Software bug couldallow for NULL pointerÆ Bug gets exercised and we get NULL pointerÆ Program seg faults when it tries to access pointerECE 254 / CPS 225 8(C) 2004 Daniel J. SorinDuration of Faults/Errors• Transient (soft): occurs once and disappears– E.g., Cosmic ray knocks charge off transistor Æ bit flip– Tend to be due to transient physical phenomena– Also known as Single Event Upset (SEU)• Intermittent: occurs occasionally– E.g., Loose connection Æ occasionally open circuit– E.g., Bug in little-used software for rounding Æ incorrect data • Permanent (hard): occurs and doesn’t go away– E.g., Broken connection Æ always open circuitPage 3ECE 254 / CPS 225 9(C) 2004 Daniel J. SorinMasking• Logical– E.g., if a fault flips a bit from 0 to 1 and it is then ANDed with a bit that is 0, then the fault cannot manifest itself as an error• Functional– E.g., incorrect data is produced by an instruction that gets squashed due to a branch misprediction– E.g., the destination register of a NOP is corrupted by a faultECE 254 / CPS 225 10(C) 2004 Daniel J. SorinOutline• Intro and Terminology• Causes of Faults and Trends• Fault Models• Two Research PapersECE 254 / CPS 225 11(C) 2004 Daniel J. SorinPhysical Defects: Transient Phenomena• Cosmic radiation (refer to Ziegler’s paper)– High energy particles that constantly bombard Earth– May have enough energy to disrupt charge on transistor (Qcrit)– Used to be only a problem for DRAM, but becoming a problem for SRAM and even for logic (as Qcritdecreases)– Trends: » Qcritdecreasing» Prob increasing that a cosmic ray that hits a transistor will disrupt its charge» Transistor size decreasing Æ smaller probability that a cosmic ray will hit a particular transistor» More transistors per system Æ greater probability of faultECE 254 / CPS 225 12(C) 2004 Daniel J. SorinPhysical Defects: Transient Phenomena• Alpha particle radiation– Similar to cosmic rays, but radiation comes from metal decay– Often, the metal housing of the computer is the source– Lead solder joints also a problem Æ want to use “old lead”– Trends (same as for cosmic radiation):» Qcritdecreasing» Prob increasing that an alpha particle that hits a transistor will disrupt its charge» Transistor size decreasing Æ smaller probability that an alpha particle will hit a particular transistor» More transistors per system Æ greater probability of faultPage 4ECE 254 / CPS 225 13(C) 2004 Daniel J. SorinPhysical Defects: Transient Phenomena• Electromagnetic Interference (EMI)– Electromagnetic waves from other sources (e.g., microwave oven, power lines, etc.) can cause transient disruptions– EMI can induce electrical current on wires and thus change the signals on wires • There are other sources of transient faults, but they tend to be less significantECE 254 / CPS 225 14(C) 2004 Daniel J. SorinPhysical Defects: Manufacturing Defects• Manufacturing is not a perfect process, especially for microprocessors– It’s not easy to manufacture something with dimensions on the order of <100nm– Many stages of chip processing which have to be done perfectly and avoid contamination• And testing doesn’t filter out all defective systems– Often impossible to test for every possible defect in a reasonable amount of time– Also, testing won’t detect defects that don’t manifest immediately ECE 254 / CPS 225 15(C) 2004 Daniel J. SorinPhysical Defects: Manufacturing Defects• Manufacturing flaws– Bad solder connection between chip and board– VLSI defects (e.g., broken wire, bad via, etc. – see ECE 261)– Trends: » Flaws may decrease as manufacturing process matures» But flaws increase at start of each new process» Tougher to avoid VLSI defects as dimensions shrinkECE 254 / CPS 225 16(C) 2004 Daniel J. SorinPhysical Defects: Manufacturing Defects• VLSI fabrication process variability– During fab, there’s some amount of variability in dimensions» Thickness of gate oxide dielectric» Length of channel» Area of via» Etc.gate oxidesourcedraingatechanneln+n+Page 5ECE 254 / CPS 225 17(C) 2004 Daniel J. SorinPhysical Defects: Manufacturing Defects• Variability can lead to undesirable behavior– Gate thickness falls below usable threshold Æ leakage current– Wire resistance is too high Æ signal


View Full Document

Duke ECE 254 - Fault Tolerant and Testable Computing Systems

Documents in this Course
Load more
Download Fault Tolerant and Testable Computing Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fault Tolerant and Testable Computing Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fault Tolerant and Testable Computing Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?