**Unformatted text preview:**

Lecture 6 Floating Point Arithmetic Absolute and Relative Error AMath 352 Fri Apr 9 1 10 Rounding There are four rounding modes in the IEEE standard If x is a real number that cannot be stored exactly then it is replaced by a nearby oating point number according to one of the following rules cid 73 Round down Round x is the largest oating point number that is less than or equal to x cid 73 Round up Round x is the smallest oating point number that is greater than or equal to x cid 73 Round towards 0 Round x is either round down x or round up x whichever lies between 0 and x Thus if x is positive then round x round down x while if x is negative then round x round up x cid 73 Round to nearest Round x is either round down x or round up x whichever is closer In case of a tie it is the one whose least signi cant rightmost bit is 0 The default is round to nearest 2 10 Rounding Cont Using double precision the number 1 replaced by 10 1 10011002 2 4 is 0 01111111011 1001100110011001100110011001100110011001100110011001 using round down or round towards 0 while it becomes 0 01111111011 1001100110011001100110011001100110011001100110011010 using round up or round to nearest Note also the exponent eld which is the binary representation of 1019 or 1023 plus the exponent 4 3 10 In double precision if x 1 b1 b52b53 2 2E where E is within the range of representable exponents 1022 to 1023 then the absolute rounding error associated with x is less than 2 52 2E for any rounding mode the worst rounding errors occur if for example b53 b54 1 and round towards 0 is used For round to nearest the absolute rounding error is less than or equal to 2 53 2E with the worst case being attained if say b53 1 and b54 0 in this case if b52 0 then x would be replaced by 1 b1 b52 2E while if b52 1 then x would be replaced by this number plus 2 52 2E Note that 2 52 is machine cid 15 for double precision Absolute Rounding Error The absolute rounding error associated with a number x is de ned as round x x 4 10 Absolute Rounding Error The absolute rounding error associated with a number x is de ned as round x x In double precision if x 1 b1 b52b53 2 2E where E is within the range of representable exponents 1022 to 1023 then the absolute rounding error associated with x is less than 2 52 2E for any rounding mode the worst rounding errors occur if for example b53 b54 1 and round towards 0 is used For round to nearest the absolute rounding error is less than or equal to 2 53 2E with the worst case being attained if say b53 1 and b54 0 in this case if b52 0 then x would be replaced by 1 b1 b52 2E while if b52 1 then x would be replaced by this number plus 2 52 2E Note that 2 52 is machine cid 15 for double precision 4 10 Since we have seen that round x x cid 15 2E when x is of the form m 2E 1 m 2 it follows that the relative rounding error is less than cid 15 2E m 2E cid 15 For round to nearest the relative rounding error is less than or equal to cid 15 2 This means that for any real number x in the range of numbers that can be represented by normalized oating point numbers we can write round x x 1 where cid 15 or cid 15 2 for round to nearest Relative Rounding Error Usually one is interested not in the absolute rounding error but in the relative rounding error de ned as round x x x 5 10 Relative Rounding Error Usually one is interested not in the absolute rounding error but in the relative rounding error de ned as round x x x Since we have seen that round x x cid 15 2E when x is of the form m 2E 1 m 2 it follows that the relative rounding error is less than cid 15 2E m 2E cid 15 For round to nearest the relative rounding error is less than or equal to cid 15 2 This means that for any real number x in the range of numbers that can be represented by normalized oating point numbers we can write round x x 1 where cid 15 or cid 15 2 for round to nearest 5 10 What You Really Need to Know The IEEE standard requires that the result of an operation addition subtraction multiplication or division on two oating point numbers must be the correctly rounded value of the exact result For numerical analysts this is important It implies that if a and b are oating point numbers and cid 9 and cid 11 represent oating point addition subtraction multiplication and division then we will have a b round a b a b 1 1 a cid 9 b round a b a b 1 2 a b round ab ab 1 3 a cid 11 b round a b a b 1 4 where i cid 15 or cid 15 2 for round to nearest i 1 4 This is important in the analysis of many algorithms 6 10 Consider subtracting the single precision number 1 1 12 2 1 from 1 1 00000000000000000000000 11111111111111111111111 0 00000000000000000000000 20 1 20 1 20 The result is 1 02 2 24 a perfectly good oating point number so the IEEE standard requires that we compute this number exactly In order to do this a guard bit is needed to keep track of the 1 to the right of the register after the second number is shifted Cray computers used to get this wrong because they had no guard bit Correctly Rounded Floating Point Operations While the idea of correctly rounded oating point operations sounds quite natural and reasonable it turns out that it is not so easy to accomplish 7 10 Correctly Rounded Floating Point Operations While the idea of correctly rounded oating point operations sounds quite natural and reasonable it turns out that it is not so easy to accomplish Consider subtracting the single precision number 1 1 12 2 1 from 1 1 00000000000000000000000 11111111111111111111111 0 00000000000000000000000 20 1 20 1 20 The result is 1 02 2 24 a perfectly good oating point number so the IEEE standard requires that we compute this number exactly In order to do this a guard bit is needed to keep track of the 1 to the right of the register after the second number is shifted Cray computers used to get this wrong because they had no guard bit 7 10 A tricky case Subtract 1 0 012 2 25 from 1 1 00000000000000000000000 00000000000000000000000 0 11111111111111111111111 0100000000000000000000001 20 1011111111111111111111111 20 20 Renormalizing and using round to nearest the result is 1 1 12 2 …

View Full Document