Page 1 1 CS6810 School of Computing University of Utah DISKS & Storage Today’s topics: Faults & RAS RAID models Some underlying disk technology very brief – more complicated than you might guess more depth will appear in CS7810 2 CS6810 School of Computing University of Utah Reliability • RAS reliability – absence of observable faults (hard, soft, human) » redundancy is always the key here availability – system level concept » does it still supply the service » how much degradation under certain fault models serviceability » can system be repaired while it’s running • lots of engineering issues to enable hot-swap 3 CS6810 School of Computing University of Utah Faults • Categories HW » did something break • several types: wire, component, connector, power supply, cooling, … design » bug in either software or hardware • check known errors in any current uP – software work arounds are key until next fab run operational » most common: screw up by operations/maintenance staff environmental » power or network loss, fire, flood, sabotage, … 4 CS6810 School of Computing University of Utah Fault Types • Transient non-recurring » causes • environmental noise event – lightning • alpha particle strike » basically impossible to find so you need to compensate by design • parity, CRC, …, reboot intermittent » recurring but somewhat rare • cross-talk • transistor malfunction at a certain temp that is rare » again compensate by design permanent » something just breaks and stays broken » finding these are typically easy » compensate & service to meet RAS targetPage 2 5 CS6810 School of Computing University of Utah Failure Reality • System is what we care about sum of it’s components – weakest link theory applies N components fail N times more often » think early multi-engine airplanes today small number of components have increased system reliability somewhat surprising IC property » IC failure rate has remained fairly flat • even w/ Moore’s law growth of transistors » we are likely entering a different era • how to build reliable systems from flakey components? • hot current research topic • Metrics 6 CS6810 School of Computing University of Utah FIT Metric • 1 FIT = 1 failure in 109 hours FIT ::= failure in time (billion hours) » billion hours = 114,155 years » 3-5 year expected lifetime » need ~10-5 FIT reliability • MTTF = MTBF calculating MTBF » ri = FIT rate of ith component » qi is the quantity of the ith component » n is the total number of components 7 CS6810 School of Computing University of Utah Improving Reliability • Make better parts doable in some cases & huge cost adder • Use less parts natural consequence of higher levels of integration • Employ redundancy common choice » 2x – OK as long as we agree » 3x – vote and 1 can fail » Nx – vote and (N/2)-1 can fail duplicate what? » bits, components, wires, gates, …. » huge choice set • bits and components are common choices today • wires and gates may be in our future – if intra-IC devices become flakey • Bottom line – Pandora’s box just opened Dan Siewiorek’s book is an excellent reference text 8 CS6810 School of Computing University of Utah Failure Model • No design makes sense without a reasonable failure model amazing how many times this mistake is made how reliable does your system have to be & what are the consequences of failure » note difference between PC and nuclear power plant monitors characterize your components » MTBF equation comes into play • Examples transistors and wires fail on a chip » highly localized noise burst errors in transmission disk oxide deterioration affects an area » area likely to expand over timePage 3 9 CS6810 School of Computing University of Utah Reliability, Disks, and Modern Systems • Think selfishly what would be a bigger disaster » losing your files » losing your PC » if they are the same, you really should fix this YESTERDAY • The point we view disk storage as archival in most cases backups are increasingly on disk » commercial archives are often tape based for “old stuff” • cheaper but a pain in the tuckus to retrieve from the cave checkpoints are always on disk NVRAM option may be cost effective in the future » more on this next lecture • So let’s look at disk reliability and then a brief glance at the underlying technology 10 CS6810 School of Computing University of Utah RAID • 1987 – Redundant array of inexpensive disks Patterson, Gibson, Katz @ UCB » Gibson now at CMU » Katz made it happen while he was at DARPA » now it’s everywhere • Reliability through redundancy key idea is to stripe data over more than 1 disk avoid disaster on a single point failure » e.g. head crash, AWOL controller, … » even better • make sure disks are physically separate – EMP or earthquake takes out a warehouse striping model determines RAID type » also improves access time for large files • no additional seeks between tracks » also impacts cost 11 CS6810 School of Computing University of Utah RAID 0 • No redundancy hence a bit of a misnomer cheap but unable to withstand a single failure » except for those corrrectable w/ block CRC’s • access advantage is the only benefit source: Wikipedia 12 CS6810 School of Computing University of Utah RAID1 • Mirroring files on both disks CRC check block option says if one disk fails you’ll know » you’re betting that both won’t fail concurrently note interesting option » read disk that delivers first • if taken this destroys arm synchronization which will penalize writes • as usual – you want to optimize the common case which is read access most expensive » 2x disks for x capacity » w.r.t. RAID0 • read energy minimized – same as RAID 0 • write energy doubles over RAID 0 • large block
View Full Document