CS 152 Computer Architecture and Engineering Lecture 16 Error Correcting Codes 2006 10 24 John Lazzaro www cs berkeley edu lazzaro TAs Udam Saini and Jue Sun www inst eecs berkeley edu cs152 CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB F05 TA Notes Final Project Checkof The SyncMeisters had everything working besides the test file hammer However hammer is the hardest test so in a sense their project was a bug or two away from working Writing back to the regfile when a stall occurs on the cache just before or after needing to write seems to mess up their processor The other tests worked fine on board CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB When cache bugs make it to product Testing our financial trading system we found a case where our software would get a bad calculation Once a week or so Eventually the problem turned out to be a failure in a CPU cache line refresh This was a hardware design fault in the PC The test suite included running the code for two weeks at maximum update rate without error so this bug was found CS 152 L16 Error Correcting Codes Eric Ulevik UC Regents Fall 2006 UCB Today Computing in an imperfect world Detecting and correcting RAM bit errors Replacing lost network packets recovering from disk drive failure Detecting arbitrary bit errors in network packets CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB DRAM Challenge Cosmic Rays Bit Line n Word Line Vdd Cell capacitor holds 25 000 electrons or less Cosmic rays that constantly bombard us can release the oxide charge oxide n pCosmic ray CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB Can this happen in SRAM Gnd Vdd Cosmic ray discharge Gnd sC Vdd Gnd A race Can P1 restore middle node to Vdd before P2 flips CS 152 L16 Error Correcting Codes Vdd P1 P2 Gnd UC Regents Fall 2006 UCB Practical efect of a cosmic ray ADDIU R1 R0 7 SW R1 100 R0 Address 100 0b00 0111 Cosm ic ray hit LW R1 100 R0 Address 100 0b00 0011 After LW R1 holds 3 but it should 7 Bit flipshold on memory holding instructions are bad too CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB To detect errors add P a parity bit Extra parity bit for every word Not seen by software Hardware computes it on every write so that the number of 1 s in every P 33 bit word is even even parity Address 100 0b00 0111 1 Does this work if two bits flip If three Cosm ic ray hit Address 100 0b00 0011 1 On a read count the number of 1s If odd a bithalt flipped So the program and reboot Application may know if this bit matters CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB Error Correction Hamming Codes Richard Hamming Computing pioneer Famous quote Computers are not for numbers Computers are CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB Trick Compute parity of subsets of bits Consider 4 bit words D D D D 0 1 10 Add 3 parity bits P P P Each parity bit computed on a subset of bits P D xor D xor D 0 P D xor D xor D 0 P D xor D xor D 0 1 xor 1 xor 1 xor 1 0 xor 0 1 xor 0 1 xor Use this word bit arrangement D D D P D P P 0 1 1 0 01 1 Just believe for now we will justify later CS 152 L16 Error Correcting Codes UC Regents Fall 2006 UCB Case 1 No cosmic ray hits D D D P D P P 0 1 1 0 01 1 We No errors write D D D P D P P Later we but how do 0 1 1 0 01 1 read we know On readout we compute that P xor D xor D xor D 0 xor 0 xor 1 xor 1 0 C P xor D xor D xor D 1 xor 0 xor 1 xor 0 0 C P xor D xor D xor D 1 xor 0 xor 1 xor 0 0 C 1 0 If C C C 0 no errors These equations come from how we computed P P P P D xor D xor D 0 P D xor D xor D 0 P D xor D xor D 0 CS 152 L16 Error Correcting Codes 1 xor 1 xor 1 xor 1 0 xor 0 1 xor 0 1 xor UC Regents Fall 2006 UCB Case 2 A cosmic ray hits D D D P D P P 0 1 1 0 01 1 We Cosmic ray write hit D1 But D D D P D P P Later we how do we 0 1 0 0 01 1 read know that On readout we compute P xor D xor D xor D 0 xor 0 xor 1 xor 0 1 C C C C b101 5 P xor D xor D xor D 1 xor 0 xor 1 xor 0 0 C P xor D xor D xor D 1 xor 0 xor 0 xor 0 1 C What 1 0 Note we number the least significant bit with 1 not 0 0 is reserved for no errors CS 152 L16 Error Correcting Codes 7 654 3 2 1 D D D P D P P 0 1 0 0 01 1 does 5 mean The position of the flipped bit To repair just UC Regents Fall 2006 UCB Why did we choose 3 parity bits Consider 4 bit words D D D D 0 1 10 Add 3 parity bits P P P A C in C C C exists for each Pi i Observation The C C C bits need to encode the no error condition plus a number for each bit both data and parity bits For p parity bits and d data bits d p 1 2 CS 152 L16 Error Correcting Codes p UC Regents Fall 2006 UCB Why did we arrange bits as we did Consider 4 bit words D D D D Add 3 parity bits P P P How do we re arrange bits With this order an odd parity means an error in 1 3 5 or 7 So P0 is C C C the right CS 152 L16 Error Correcting Codes A C in C C C exists for each Pi i Start by numberi D D D P D P P ng 1 to D D D P D P P 7 D D D P D P P Etc D D D P D P P each bit narrows An odd parity means down the a mistake must be in suspect 2 3 6 or 7 the bits four numbers 7 6 54 3 2 1 UC Regents Fall 2006 UCB Why did we arrange bits as we did Consider 4 bit words D D D D P D xor D …
View Full Document
Unlocking...