DOC PREVIEW
TAMU CSCE 614 - Lec06-IO2

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CPSC614: Graduate Computer Architecture I/O 2: Failure Terminology, Examples, Gray Paper and a little Queueing Theory prof. Lawrence RauchwergerReview StorageReview: RAID Techniques: Goal was performance, popularity due to reliability of storageOutlineDefinitionsIFIP Standard terminologyFault v. (Latent) Error v. FailureFault v. (Latent) Error v. FailureSlide 9Slide 10Fault Tolerance vs Disaster ToleranceCS 252 AdministriviaDefining reliability and availability quantitativelyFail-Fast is Good, Repair is NeededDependability: The 3 ITIESReliability ExampleWhat's wrong with MTTF?Slide 18Dependability Big Idea: No Single Point of FailureHW Failures in Real Systems: Tertiary DisksWhen To Repair?Sources of FailuresCase Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).Case Studies - Tandem Trends Reported MTTF by ComponentIs Maintenance the Key?OK: So FarDoes Hardware Fail Fast? 4 of 384 Disks that failed in Tertiary DiskHigh Availability System Classes Goal: Build Class 6 SystemsHow Realistic is "5 Nines"?Demo: looking at some nodesops/s/$ Had Three Growth Curves 1890-1990The List (Red is AI Complete)Trouble-Free SystemsTrustworthy SystemsSummary: DependabilityCPSC614: Graduate Computer Architecture I/O 2: Failure Terminology, Examples, Gray Paper and a little Queueing Theoryprof. Lawrence RauchwergerBased on lectures by Prof. David A. PattersonUC BerkeleyReview Storage•Disks:–Extraodinary advance in capacity/drive, $/GB–Currently 17 Gbit/sq. in. ; can continue past 100 Gbit/sq. in.?–Bandwidth, seek time not keeping up: 3.5 inch form factor makes sense? 2.5 inch form factor in near future? 1.0 inch form factor in long term?•Tapes–No investment, must be backwards compatible–Are they already dead?–What is a tapeless backup system?Review: RAID Techniques: Goal was performance, popularity due to reliability of storage• Disk Mirroring, Shadowing (RAID 1)Each disk is fully duplicated onto its "shadow" Logical write = two physical writes100% capacity overhead• Parity Data Bandwidth Array (RAID 3)Parity computed horizontallyLogically a single high data bw disk• High I/O Rate Parity Array (RAID 5)Interleaved parity blocksIndependent reads and writesLogical write = 2 reads + 2 writes100100111100110110010011001100101001001110010011Outline•Reliability Terminology•Examlpes•Discuss Jim Gray’s Turing paperDefinitions•Examples on why precise definitions so important for reliability•Is a programming mistake a fault, error, or failure? –Are we talking about the time it was designed or the time the program is run? –If the running program doesn’t exercise the mistake, is it still a fault/error/failure?•If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesn’t change the value? –Is it a fault/error/failure if the memory doesn’t access the changed bit? –Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU?IFIP Standard terminology•Computer system dependability: quality of delivered service such that reliance can be placed on service•Service is observed actual behavior as perceived by other system(s) interacting with this system’s users•Each module has ideal specified behavior, where service specification is agreed description of expected behavior•A system failure occurs when the actual behavior deviates from the specified behavior•failure occurred because an error, a defect in module•The cause of an error is a fault•When a fault occurs it creates a latent error, which becomes effective when it is activated•When error actually affects the delivered service, a failure occurs (time from error to failure is error latency)Fault v. (Latent) Error v. Failure •A fault creates one or more latent errors•Properties of errors are –a latent error becomes effective once activated–an error may cycle between its latent and effective states–an effective error often propagates from one component to another, thereby creating new errors •Effective error is either a formerly-latent error in that component or it propagated from another error•A component failure occurs when the error affects the delivered service•These properties are recursive, and apply to any component in the system•An error is manifestation in the system of a fault, a failure is manifestation on the service of an errorFault v. (Latent) Error v. Failure•An error is manifestation in the system of a fault, a failure is manifestation on the service of an error•Is a programming mistake a fault, error, or failure? –Are we talking about the time it was designed or the time the program is run? –If the running program doesn’t exercise the mistake, is it still a fault/error/failure?•A programming mistake is a fault•the consequence is an error (or latent error) in the software•upon activation, the error becomes effective•when this effective error produces erroneous data which affect the delivered service, a failure occursFault v. (Latent) Error v. Failure•An error is manifestation in the system of a fault, a failure is manifestation on the service of an error•Is If an alpha particle hits a DRAM memory cell, is it a fault/error/failure if it doesn’t change the value? –Is it a fault/error/failure if the memory doesn’t access the changed bit? –Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU? •An alpha particle hitting a DRAM can be a fault•if it changes the memory, it creates an error•error remains latent until effected memory word is read•if the effected word error affects the delivered service, a failure occursFault v. (Latent) Error v. Failure•An error is manifestation in the system of a fault, a failure is manifestation on the service of an error•What if a person makes a mistake, data is altered, and service is affected? •fault:•error:•latent:•failure:Fault Tolerance vs Disaster Tolerance•Fault-Tolerance (or more properly, Error-Tolerance): mask local faults(prevent errors from becoming failures)–RAID disks–Uninterruptible Power Supplies–Cluster Failover •Disaster Tolerance: masks site errors(prevent site errors from causing service failures)–Protects against fire, flood, sabotage,..–Redundant system and service at remote site.–Use design diversity From Jim Gray’s “Talk at UC Berkeley on Fault


View Full Document

TAMU CSCE 614 - Lec06-IO2

Documents in this Course
Load more
Download Lec06-IO2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lec06-IO2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lec06-IO2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?