HPDC 15 21 June 2006 Fault Tolerance of Tornado Codes for Archival Storage Matthew Woitaszek matthew woitaszek colorado edu Key Points Tornado Codes can provide fault tolerant storage Single site applications better than RAID Distributed applications better than replication Verify then trust Tornado codes can work with existing Data Grids Low computational overhead Behind the scenes server side fault tolerance No reason to alter interfaces use existing Data Grid tools 21 June 2006 2 Outline Background and Motivation Experimental Method and Results Applications to Distributed and Federated Storage Conclusions and Future Work 21 June 2006 3 Storage for High Performance Computing Archival Storage Tape silo systems Working Storage Disk systems E NT E RP RIS E E NT E RP RIS E 6 0 0 0 6 0 0 0 Archive Management Origin 3200 TM TM Grid Front End GridFTP Servers Collaborative Storage Massive disk or tape arrays Shared using Grid technology Distributed data stewarding 21 June 2006 4 Motivation Filesystem Features Typical desired features Performance no waiting Availability no downtime Reliability no data loss Inexpensive no cost Scalability no limits Apply Tornado Codes to distributed archival storage Prime directive Never lose data Optimize performance and resource utilization Leverage existing legacy technology Support emerging technologies and solutions Grid MAID 21 June 2006 5 LDPC Codes and Tornado Codes Data Nodes Check Nodes A B C Low Density Parity Check codes Gallager 1963 Luby 1990s F G Cascaded irregular LDPC graphs are Tornado Codes Average degree of connectivity D E H Distribution of degree Randomly generated Simple XOR operation has low computational overhead 21 June 2006 6 Storage Applications Safety and Speed Tornado Code advantages Fault tolerance Performance optimizations Data Nodes Check Nodes A B C 21 June 2006 G Retrieve C or G If one node is dead the decision is easy If both nodes are available choose the node with less waiting 7 Related Work Typhoon

