1 Floating Point Arithmetic and Dense Linear Algebra Lecture 12 CS 594 Jack Dongarra 2 Question: Suppose we want to compute using four decimal arithmetic: S = 1.000 + 1.000x104 – 1.000x104 What’s the answer? Ariane 5 rocket June 1996 exploded when a 64 bit fl pt number relating to the horizontal velocity of the rocket was converted to a 16 bit signed integer. The number was larger than 32,767, and thus the conversion failed. $500M rocket and cargo2 3 Two sources of numerical error 1) Round off error 2) Truncation error 4 Round off Error Caused by representing a number approximately3 5 Problems created by round off error 28 Americans were killed on February 25, 1991 by an Iraqi Scud missile in Dhahran, Saudi Arabia. The patriot defense system failed to track and intercept the Scud. Why? 6 Problem with Patriot missile Clock cycle of 1/10 seconds was represented in 24-bit fixed point register created an error of 9.5 x 10-8 seconds. The battery was on for 100 consecutive hours, thus causing an inaccuracy of .1 (decimal) = 00111101110011001100110011001101 (binary)4 7 Problem (cont.) The shift calculated in the ranging system of the missile was 687 meters. The target was considered to be out of range at a distance greater than 137 meters. 8 Truncation error Error caused by truncating or approximating a mathematical procedure.5 9 Example of Truncation Error Taking only a few terms of a Maclaurin series to approximate If only 3 terms are used, 10 Defining Floating Point Arithmetic Representable numbers Scientific notation: +/- d.d…d x rexp sign bit +/- radix r (usually 2 or 10, sometimes 16) significand d.d…d (how many base-r digits d?) exponent exp (range?) others? Operations: arithmetic: +,-,x,/,... » how to round result to fit in format comparison (<, =, >) conversion between different formats » short to long FP numbers, FP to integer exception handling » what to do for 0/0, 2*largest_number, etc. binary/decimal conversion » for I/O, when radix not 106 11 IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp Macheps = Machine epsilon = 2-#significand bits = relative error in each operation smallest number such that fl( 1 + ) > 1 OV = overflow threshold = largest number UN = underflow threshold = smallest number +- Zero: +-, significand and exponent all zero Why bother with -0 later Format # bits #significand bits macheps #exponent bits exponent range ---------- -------- ----------------------- ------------ -------------------- ---------------------- Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 - 216383 (~10+-4932) Extended (80 bits on all Intel machines) 12 IEEE Floating Point Arithmetic Standard 754 - “Denorms” Denormalized Numbers: +-0.d…d x 2min_exp sign bit, nonzero significand, minimum exponent Fills in gap between UN and 0 Underflow Exception occurs when exact nonzero result is less than underflow threshold UN Ex: UN/3 return a denorm, or zero7 13 IEEE Floating Point Arithmetic Standard 754 - +- Infinity +- Infinity: Sign bit, zero significand, maximum exponent Overflow Exception occurs when exact finite result too large to represent accurately Ex: 2*OV return +- infinity Divide by zero Exception return +- infinity = 1/+-0 sign of zero important! Also return +- infinity for 3+infinity, 2*infinity, infinity*infinity Result is exact, not an exception! 14 IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number) NAN: Sign bit, nonzero significand, maximum exponent Invalid Exception occurs when exact result not a well-defined real number 0/0 sqrt(-1) infinity-infinity, infinity/infinity, 0*infinity NAN + 3 NAN > 3? Return a NAN in all these cases Two kinds of NANs Quiet - propagates without raising an exception Signaling - generate an exception when touched » good for detecting uninitialized data8 15 Error Analysis Basic error formula fl(a op b) = (a op b)*(1 + d) where » op one of +,-,*,/ » |d| macheps » assuming no overflow, underflow, or divide by zero Example: adding 4 numbers fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)*(1+d3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*macheps get exact sum of slightly changed summands xi*(1+ei) Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs Numerical Stability is an algorithm design goal 16 Backward error Approximate solution is exact solution to modified problem. How large a modification to original problem is required to give result actually obtained? How much data error in initial input would be required to explain all the error in computed results? Approximate solution is good if it is exact solution to “nearby” problem. f f(x) x Forward error Backward error f’(x) f’ f x’9 17 Sensitivity and Conditioning Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data. Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x| Problem is sensitive, or ill-conditioned, if cond >> 1. When function f is evaluated for approximate input x’ = x+h instead of true input value of x. Absolute error = f(x + h) – f(x) h f’(x)
View Full Document