DOC PREVIEW
U of U CS 6810 - DISKS & Storage

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah DISKS & Storage Today’s topics: Faults & RAS RAID models Some underlying disk technology very brief – more complicated than you might guess more depth will appear in CS7810 2 CS6810 School of Computing University of Utah Reliability • RAS  reliability – absence of observable faults (hard, soft, human) » redundancy is always the key here  availability – system level concept » does it still supply the service » how much degradation under certain fault models  serviceability » can system be repaired while it’s running • lots of engineering issues to enable hot-swapPage 2 3 CS6810 School of Computing University of Utah Faults • Categories  HW » did something break • several types: wire, component, connector, power supply, cooling, …  design » bug in either software or hardware • check known errors in any current uP – software work arounds are key until next fab run  operational » most common: screw up by operations/maintenance staff  environmental » power or network loss, fire, flood, sabotage, … 4 CS6810 School of Computing University of Utah Fault Types • Transient  non-recurring » causes • environmental noise event – lightning • alpha particle strike » basically impossible to find so you need to compensate by design • parity, CRC, …, reboot  intermittent » recurring but somewhat rare • cross-talk • transistor malfunction at a certain temp that is rare » again compensate by design  permanent » something just breaks and stays broken » finding these are typically easy » compensate & service to meet RAS targetPage 3 5 CS6810 School of Computing University of Utah Failure Reality • System is what we care about  sum of it’s components – weakest link theory applies  N components fail N times more often » think early multi-engine airplanes  today small number of components have increased system reliability  somewhat surprising IC property » IC failure rate has remained fairly flat • even w/ Moore’s law growth of transistors » we are likely entering a different era • how to build reliable systems from flakey components? • hot current research topic • Metrics 6 CS6810 School of Computing University of Utah FIT Metric • 1 FIT = 1 failure in 109 hours  FIT ::= failure in time (billion hours) » billion hours = 114,155 years » 3-5 year expected lifetime » need ~10-5 FIT reliability • MTTF = MTBF  calculating MTBF » ri = FIT rate of ith component » qi is the quantity of the ith component » n is the total number of componentsPage 4 7 CS6810 School of Computing University of Utah Improving Reliability • Make better parts  doable in some cases & huge cost adder • Use less parts  natural consequence of higher levels of integration • Employ redundancy  common choice » 2x – OK as long as we agree » 3x – vote and 1 can fail » Nx – vote and (N/2)-1 can fail  duplicate what? » bits, components, wires, gates, …. » huge choice set • bits and components are common choices today • wires and gates may be in our future – if intra-IC devices become flakey • Bottom line – Pandora’s box just opened  Dan Siewiorek’s book is an excellent reference text 8 CS6810 School of Computing University of Utah Failure Model • No design makes sense without a reasonable failure model  amazing how many times this mistake is made  how reliable does your system have to be & what are the consequences of failure » note difference between PC and nuclear power plant monitors  characterize your components » MTBF equation comes into play • Examples  transistors and wires fail on a chip » highly localized  noise  burst errors in transmission  disk  oxide deterioration affects an area » area likely to expand over timePage 5 9 CS6810 School of Computing University of Utah Reliability, Disks, and Modern Systems • Think selfishly  what would be a bigger disaster » losing your files » losing your PC » if they are the same, you really should fix this YESTERDAY • The point  we view disk storage as archival in most cases  backups are increasingly on disk » commercial archives are often tape based for “old stuff” • cheaper but a pain in the tuckus to retrieve from the cave  checkpoints are always on disk  NVRAM option may be cost effective in the future » more on this next lecture • So let’s look at disk reliability  and then a brief glance at the underlying technology 10 CS6810 School of Computing University of Utah RAID • 1987 – Redundant array of inexpensive disks  Patterson, Gibson, Katz @ UCB » Gibson now at CMU » Katz made it happen while he was at DARPA » now it’s everywhere • Reliability through redundancy  key idea is to stripe data over more than 1 disk  avoid disaster on a single point failure » e.g. head crash, AWOL controller, … » even better • make sure disks are physically separate – EMP or earthquake takes out a warehouse  striping model determines RAID type » also improves access time for large files • no additional seeks between tracks » also impacts costPage 6 11 CS6810 School of Computing University of Utah RAID 0 • No redundancy  hence a bit of a misnomer  cheap but unable to withstand a single failure » except for those corrrectable w/ block CRC’s • access advantage is the only benefit source: Wikipedia 12 CS6810 School of Computing University of Utah RAID1 • Mirroring  files on both disks  CRC check block option says if one disk fails you’ll know » you’re betting that both won’t fail concurrently  note interesting option » read disk that delivers first • if taken this destroys arm synchronization which will penalize writes • as usual – you want to optimize the common case which is read access  most expensive » 2x disks for x capacity » w.r.t. RAID0 • read energy minimized – same as RAID 0 • write energy doubles over RAID 0


View Full Document

U of U CS 6810 - DISKS & Storage

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download DISKS & Storage
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DISKS & Storage and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DISKS & Storage 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?