Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah DISKS & Storage Today’s topics: Faults & RAS RAID models Some underlying disk technology very brief – more complicated than you might guess more depth will appear in CS7810 2 CS6810 School of Computing University of Utah Reliability • RAS  reliability – absence of observable faults (hard, soft, human) » redundancy is always the key here  availability – system level concept » does it still supply the service » how much degradation under certain fault models  serviceability » can system be repaired while it’s running • lots of engineering issues to enable hot-swap 3 CS6810 School of Computing University of Utah Faults • Categories  HW » did something break • several types: wire, component, connector, power supply, cooling, …  design » bug in either software or hardware • check known errors in any current uP – software work arounds are key until next fab run  operational » most common: screw up by operations/maintenance staff  environmental » power or network loss, fire, flood, sabotage, … 4 CS6810 School of Computing University of Utah Fault Types • Transient  non-recurring » causes • environmental noise event – lightning • alpha particle strike » basically impossible to find so you need to compensate by design • parity, CRC, …, reboot  intermittent » recurring but somewhat rare • cross-talk • transistor malfunction at a certain temp that is rare » again compensate by design  permanent » something just breaks and stays broken » finding these are typically easy » compensate & service to meet RAS targetPage 2 5 CS6810 School of Computing University of Utah Failure Reality • System is what we care about  sum of it’s components – weakest link theory applies  N components fail N times more often » think early multi-engine airplanes  today small number of components have increased system reliability  somewhat surprising IC property » IC failure rate has remained fairly flat • even w/ Moore’s law growth of transistors » we are likely entering a different era • how to build reliable systems from flakey components? • hot current research topic • Metrics 6 CS6810 School of Computing University of Utah FIT Metric • 1 FIT = 1 failure in 109 hours  FIT ::= failure in time (billion hours) » billion hours = 114,155 years » 3-5 year expected lifetime » need ~10-5 FIT reliability • MTTF = MTBF  calculating MTBF » ri = FIT rate of ith component » qi is the quantity of the ith component » n is the total number of components 7 CS6810 School of Computing University of Utah Improving Reliability • Make better parts  doable in some cases & huge cost adder • Use less parts  natural consequence of higher levels of integration • Employ redundancy  common choice » 2x – OK as long as we agree » 3x – vote and 1 can fail » Nx – vote and (N/2)-1 can fail  duplicate what? » bits, components, wires, gates, …. » huge choice set • bits and components are common choices today • wires and gates may be in our future – if intra-IC devices become flakey • Bottom line – Pandora’s box just opened  Dan Siewiorek’s book is an excellent reference text 8 CS6810 School of Computing University of Utah Failure Model • No design makes sense without a reasonable failure model  amazing how many times this mistake is made  how reliable does your system have to be & what are the consequences of failure » note difference between PC and nuclear power plant monitors  characterize your components » MTBF equation comes into play • Examples  transistors and wires fail on a chip » highly localized  noise  burst errors in transmission  disk  oxide deterioration affects an area » area likely to expand over timePage 3 9 CS6810 School of Computing University of Utah Reliability, Disks, and Modern Systems • Think selfishly  what would be a bigger disaster » losing your files » losing your PC » if they are the same, you really should fix this YESTERDAY • The point  we view disk storage as archival in most cases  backups are increasingly on disk » commercial archives are often tape based for “old stuff” • cheaper but a pain in the tuckus to retrieve from the cave  checkpoints are always on disk  NVRAM option may be cost effective in the future » more on this next lecture • So let’s look at disk reliability  and then a brief glance at the underlying technology 10 CS6810 School of Computing University of Utah RAID • 1987 – Redundant array of inexpensive disks  Patterson, Gibson, Katz @ UCB » Gibson now at CMU » Katz made it happen while he was at DARPA » now it’s everywhere • Reliability through redundancy  key idea is to stripe data over more than 1 disk  avoid disaster on a single point failure » e.g. head crash, AWOL controller, … » even better • make sure disks are physically separate – EMP or earthquake takes out a warehouse  striping model determines RAID type » also improves access time for large files • no additional seeks between tracks » also impacts cost 11 CS6810 School of Computing University of Utah RAID 0 • No redundancy  hence a bit of a misnomer  cheap but unable to withstand a single failure » except for those corrrectable w/ block CRC’s • access advantage is the only benefit source: Wikipedia 12 CS6810 School of Computing University of Utah RAID1 • Mirroring  files on both disks  CRC check block option says if one disk fails you’ll know » you’re betting that both won’t fail concurrently  note interesting option » read disk that delivers first • if taken this destroys arm synchronization which will penalize writes • as usual – you want to optimize the common case which is read access  most expensive » 2x disks for x capacity » w.r.t. RAID0 • read energy minimized – same as RAID 0 • write energy doubles over RAID 0 • large block


View Full Document

U of U CS 6810 - DISKS & Storage

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download DISKS & Storage
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DISKS & Storage and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DISKS & Storage 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?