Unformatted text preview:

Page 1ECE 254 / CPS 225Real Systems: Fault Tolerant HardwareIntroductionCopyright 2004 Daniel J. SorinDuke UniversityECE 254 / CPS 225 2(C) 2004 Daniel J. SorinOutline• Microprocessors• Memory• Disks• Networks• MultiprocessorsECE 254 / CPS 225 3(C) 2004 Daniel J. SorinMicroprocessor Errors/Failures• Error models– Transient stuck-at (bit flip) on transistor or wire– Hard stuck-at on transistor or wire– Chipkill: whole chip is dead (e.g., due to power/ground short)• Failure models– Incorrect instruction Æ trap/exception– Incorrect output– Dead chip (no output and/or smoke output)ECE 254 / CPS 225 4(C) 2004 Daniel J. SorinMicroprocessor Fault Tolerance• There ain’t much!• Most common microprocessors are designed to maximize performance per dollar– Intel Pentium4– Intel Itanium– Sun UltraSPARC III– IBM Power4– Compaq Alpha (discontinued)• Microprocessors may have some limited error detection/correction in their L2 or L3 caches• Note: microprocessors are designed with hardware for performing built-in self-test (BIST). We will cover this topic towards the end of the semester.Page 2ECE 254 / CPS 225 5(C) 2004 Daniel J. SorinFault Tolerance in Custom Microprocessors• Most systems built from commodity microprocessors– Off-the-shelf parts are cost-efficient– And, even if they’re not very reliable individually, we can design reliable systems out of un-reliable parts (remember Teramac!)• However, custom microprocessors may be built for those systems which require very high availability and/or reliability• Examples: – DEC VAX (see chapter 7)– IBM mainframe microprocessors (e.g., G5 and G6)ECE 254 / CPS 225 6(C) 2004 Daniel J. SorinFault Tolerance in the VAX• DEC’s VAX was very successful family of systems– Follow-ons to DEC’s PDP-11 computer– Forerunner of DEC/Compaq/Intel Alpha processor (now dead)– VAX known today for being epitome of CISC-ness• Could detect and sometimes tolerate many faults– Illegal instruction execution– Trying to access restricted memory– Arithmetic exceptions (which may be due to faults)– Power failure– Etc.• Tries to provide info with trap/interrupt– Places fault type info into known location• Maintains registers specifically for error monitoringECE 254 / CPS 225 7(C) 2004 Daniel J. SorinMore About the VAX (1978-1987)• Early VAX-11/750 and VAX-11/780 had following FT– Built-in self-test (executed at power-on)– ECC on main memory– Multiple-bit parity on cache, TLB, and a few other structures– Parity bits on the SBI (synchronous backplane interconnect = bus)– Field-replaceable unit (FRU) is the chip (instead of board)• In the later VAX 8600 and 8700, more FT added– Instruction retry– Better diagnostics through error logging and analysis– Online self-test of floating point unit (F-box in VAX lingo)– Error handling via a microcode routine (“micro-routine)– Micro-diagnostics to self-test system and diagnose faults to FRUs– System diagnostic bus (SDB) for console control/observationECE 254 / CPS 225 8(C) 2004 Daniel J. SorinIBM RAS• “RAS Strategy for IBM S/390 G5 and G6” (Mueller et al.)Page 3ECE 254 / CPS 225 9(C) 2004 Daniel J. SorinOutline• Microprocessors• Memory• Disks• Networks• MultiprocessorsECE 254 / CPS 225 10(C) 2004 Daniel J. SorinTransient Memory Errors• Transient error models– Single bit error (single event upset: SEU)– Burst of bit errors (errors in contiguous bits) • We used to only worry about DRAM, but now we have to worry about soft errors in SRAM, too– Remember the Ziegler paper!ECE 254 / CPS 225 11(C) 2004 Daniel J. SorinPermanent Memory Errors• Error models– Single bit or multi-bit stuck-at– Memory chip failure (“chipkill”) • Chipkill memory failures– Chipkill is a fail-stop permanent error/failure model– Only applies to memory that is not on processor chip» Off-chip L2 or L3 cache» DRAM main memoryECE 254 / CPS 225 12(C) 2004 Daniel J. SorinTolerating Transient Memory Errors• Almost uniformly tolerated with EDC/ECC• If EDC, then we need a higher-level mechanism to recover from errors– So then why use EDC instead of ECC?• What kinds of EDC/ECC are appropriate for our transient error models?– Parity» Single bit» Multiple bit» Two-dimensional– CRC– Hamming code• Which EDC/ECC are NOT appropriate for memory?Page 4ECE 254 / CPS 225 13(C) 2004 Daniel J. SorinTolerating Permanent Memory Bit Errors• Caches (SRAM) and memories (DRAM) inherently have lots of redundancy– Lots of bits … why not just provision some spares?– Then, if hard fault detected, map out faulty bits and replace with spare bits– Disks have been doing this for a long time, but this is a relatively recent development for cache and DRAM• Design issue: granularity of mapping– What is the field-replaceable unit?» Bit» Row» Column– What are the trade-offs in choosing a granularity?ECE 254 / CPS 225 14(C) 2004 Daniel J. SorinTolerating Chipkill Memory Errors• Requires that we can reconstruct the data on the dead chip from redundant data on other chips• Should sound a bit like RAID protection for disksECE 254 / CPS 225 15(C) 2004 Daniel J. SorinRAID-M / Chipkill Memory• “A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory” (Dell)ECE 254 / CPS 225 16(C) 2004 Daniel J. SorinOutline• Microprocessors• Memory• Disks• Networks• MultiprocessorsPage 5ECE 254 / CPS 225 17(C) 2004 Daniel J. SorinDisk Errors• Error models– Transient single bit error– Transient burst of bit errors– Permanently bad sector (from defect or fault)» In general, disks don’t consider finer granularities– Permanently bad disk (because of storage medium or controller)ECE 254 / CPS 225 18(C) 2004 Daniel J. SorinDisk Fault Tolerance• Disks are often considered the “stable storage” on which we save critical data– E.g., databases write their important data to disks• We sometimes backup critical disk systems with tape– E.g., your home directory for your account on EE or CS system– Periodically (e.g., nightly, weekly) log diffs to tape • Disks are generally protected with– Information redundancy (EDC/ECC)– Physical redundancyECE 254 / CPS 225 19(C) 2004 Daniel J. SorinDisk Physical Redundancy• Physical redundancy at different granularities• Sector-level redundancy– Disks come with more sectors than specified– Can map out a sector with a hard fault and


View Full Document

Duke ECE 254 - Introduction

Download Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?