**Unformatted text preview:**

Lecture 6: Floating Point Arithmetic, Absoluteand Relative ErrorAMath 352Fri., Apr. 91 / 10RoundingThere are four rounding modes in the IEEE standard. If x is a realnumber that cannot be stored exactly, then it is replaced by anearby floating point number according to one of the followingrules:IRound down. Round(x) is the largest floating point numberthat is less than or equal to x.IRound up. Round(x) is the smallest floating point numberthat is greater than or equal to x.IRound towards 0. Round(x) is either round-down(x) orround-up(x), whichever lies between 0 and x. Thus if x ispositive then round(x) = round-down(x), while if x isnegative then round(x) = round-up(x).IRound to nearest. Round(x) is either round-down(x) orround-up(x), whichever is closer. In case of a tie, it is the onewhose least significant (rightmost) bit is 0.The default is round to nearest.2 / 10Rounding, Cont.Using double precision, the number110= 1.10011002× 2−4isreplaced by0 01111111011 1001100110011001100110011001100110011001100110011001using round down or round towards 0, while it becomes0 01111111011 1001100110011001100110011001100110011001100110011010using round up or round to nearest. [Note also the exponent fieldwhich is the binary representation of 1019, or, 1023 plus theexponent −4.]3 / 10Absolute Rounding ErrorThe absolute rounding error associated with a number x is definedas |round(x) − x|.In double precision, if x = ±(1.b1. . . b52b53. . .)2× 2E, where E iswithin the range of representable exponents (−1022 to 1023), thenthe absolute rounding error associated with x is less than2−52× 2Efor any rounding mode; the worst rounding errors occurif, for example, b53= b54= . . . = 1, and round towards 0 is used.For round to nearest, the absolute rounding error is less than orequal to 2−53× 2E, with the worst case being attained if, say,b53= 1 and b54= . . . = 0; in this case, if b52= 0, then x wouldbe replaced by 1.b1. . . b52× 2E, while if b52= 1, then x would bereplaced by this number plus 2−52× 2E. Note that 2−52ismachine for double precision.4 / 10Absolute Rounding ErrorThe absolute rounding error associated with a number x is definedas |round(x) − x|.In double precision, if x = ±(1.b1. . . b52b53. . .)2× 2E, where E iswithin the range of representable exponents (−1022 to 1023), thenthe absolute rounding error associated with x is less than2−52× 2Efor any rounding mode; the worst rounding errors occurif, for example, b53= b54= . . . = 1, and round towards 0 is used.For round to nearest, the absolute rounding error is less than orequal to 2−53× 2E, with the worst case being attained if, say,b53= 1 and b54= . . . = 0; in this case, if b52= 0, then x wouldbe replaced by 1.b1. . . b52× 2E, while if b52= 1, then x would bereplaced by this number plus 2−52× 2E. Note that 2−52ismachine for double precision.4 / 10Relative Rounding ErrorUsually one is interested not in the absolute rounding error but inthe relative rounding error, defined as |round(x) − x|/|x|.Since we have seen that |round(x) − x| < × 2Ewhen x is of theform ±m × 2E, 1 ≤ m < 2, it follows that the relative roundingerror is less than × 2E/(m × 2E) ≤ . For round to nearest, therelative rounding error is less than or equal to2. This means thatfor any real number x (in the range of numbers that can berepresented by normalized floating point numbers), we can writeround(x) = x(1+δ), where |δ| < (or ≤2for round to nearest).5 / 10Relative Rounding ErrorUsually one is interested not in the absolute rounding error but inthe relative rounding error, defined as |round(x) − x|/|x|.Since we have seen that |round(x) − x| < × 2Ewhen x is of theform ±m × 2E, 1 ≤ m < 2, it follows that the relative roundingerror is less than × 2E/(m × 2E) ≤ . For round to nearest, therelative rounding error is less than or equal to2. This means thatfor any real number x (in the range of numbers that can berepresented by normalized floating point numbers), we can writeround(x) = x(1+δ), where |δ| < (or ≤2for round to nearest).5 / 10What You Really Need to KnowThe IEEE standard requires that the result of an operation(addition, subtraction, multiplication, or division) on twofloating point numbers must be the correctly rounded valueof the exact result. For numerical analysts, this is important. Itimplies that if a and b are floating point numbers and ⊕, , ⊗,and represent floating point addition, subtraction,multiplication, and division, then we will havea ⊕ b = round(a + b ) = (a + b)(1 + δ1)a b = round(a − b) = (a − b)(1 + δ2)a ⊗ b = round(ab) = (ab)(1 + δ3)a b = round(a/b) = (a/b)(1 + δ4),where |δi| < (or ≤ /2 for round to nearest), i = 1, . . . , 4. This isimportant in the analysis of many algorithms.6 / 10Correctly Rounded Floating Point OperationsWhile the idea of correctly rounded floating point operationssounds quite natural and reasonable, it turns out that it is not soeasy to accomplish.Consider subtracting the single precision number 1.1 . . . 12× 2−1from 1:1.00000000000000000000000 ×20−.11111111111111111111111 1 ×200.00000000000000000000000 1 ×20The result is 1.02× 2−24, a perfectly good floating point number,so the IEEE standard requires that we compute this numberexactly. In order to do this a guard bit is needed to keep track ofthe 1 to the right of the register after the second number isshifted. Cray computers used to get this wrong because they hadno guard bit.7 / 10Correctly Rounded Floating Point OperationsWhile the idea of correctly rounded floating point operationssounds quite natural and reasonable, it turns out that it is not soeasy to accomplish.Consider subtracting the single precision number 1.1 . . . 12× 2−1from 1:1.00000000000000000000000 ×20−.11111111111111111111111 1 ×200.00000000000000000000000 1 ×20The result is 1.02× 2−24, a perfectly good floating point number,so the IEEE standard requires that we compute this numberexactly. In order to do this a guard bit is needed to keep track ofthe 1 to the right of the register after the second number isshifted. Cray computers used to get this wrong because they hadno guard bit.7 / 10Correctly Rounded Floating Point Operations, Cont.It turns out that correctly rounded arithmetic can be achievedusing 2 guard bits and a sticky bit to flag some tricky cases.A tricky case: Subtract 1.0 . . . 012× 2−25from

View Full Document