Floating-Point for CS 267 February 8, 1996 11:50 amSlide 1 What can you learn about Floating-Point Arithmetic in One Hour ?. . . . . . . . . . . . . by Prof. W. KahanUniv. of Calif. @ Berkeleyprepared for CS 267,( Profs. J.W. Demmel of UCB & A. Edelman of MIT )8 Feb. 1996Floating-Point for CS 267 February 8, 1996 11:50 amSlide 2 Numbers in Computers: ( Character Strings ... get Converted to or from ... )IntegersFixed-PointFloating-PointFloating-Point for CS 267 February 8, 1996 11:50 amSlide 3 Integers ..., -3, -2, -1, 0, 1, 2, 3, ...In all programming languages.+, -, x are Exact unless they Overflow.Overflow thresholds determined by(un)signedRadix ( 2 or 10 )wordsize ( 1 byte, 2 bytes, 4 bytes, 8 bytes, ... )( cf. type ).Division ==> Quotient and Remainder.Floating-Point for CS 267 February 8, 1996 11:50 amSlide 4 Fixed-Point -0.712 , 1.539 , 27.962 , 745.288 , ...Provided directly in C OBOL , A DA ; otherwise simulated.+ , - , x by Integer are exact unless they Overflow x , / Rounded Off to a fixed number of digits after the point.{ Available numbers } = { integers } / ( Scale Factor ) ;Scale Factor = Power of 2 or 10 ,selected by programmer to determine a format or type .Floating-Point for CS 267 February 8, 1996 11:50 amSlide 5 Floating-Point -7.12 E-01 , 1.539 E 00 , 2.7962 E 01 , 7.45288 E 02 , ...( cf. “Scientific Notation” )Called R EAL , float, D OUBLE P RECISION , ...Every arithmetic operation is rounded off to fit a Destination Format or Type depending upon language conventions and computer register-architecture ( ... Compiler ).Too Big for destination ==> Overflow.Nonzero but Too Tiny ==> Underflow.( Despite rounding, some operations are Exact ; e.g., X := -Y .)Floating-Point for CS 267 February 8, 1996 11:50 amSlide 6 Logarithmic Floating-Point { Available values } = ± (10 or 2) {Fixed Point numbers } Absent Over/Underflow, x and / are Exact , andDistributive Law X·(Y+Z) = X·Y + X·Z persists.ButSubtract is difficult to implement to near-full precision.Add, subtract are slow unless precision is short, < 6 sig. dec.Can’t represent small integers 2 and 3 exactly.Used only in a few embedded systems.Floating-Point for CS 267 February 8, 1996 11:50 amSlide 7 Conventional Floating-Point { Available values } = {long integers}·Radix {short integers} Radix = 2 or 10 or 16 .Some also have ∞ , NaN / Indefinite / Reserved Operand. Models of Roundoff Let operation • come from { + , - , x , / } ; then,absent Over/Underflow,Computed[ X•Y ] = ( X•Y )·( 1 + ß ) for some tiny ß ;| ß | < Radix ( - #Sig. Digits ) roughly , except for CRAY X-MP, Y-MP, C90, J90 which have peculiar arithmetic.Floating-Point for CS 267 February 8, 1996 11:50 amSlide 8 CRAY X-MP, Y-MP, C90, J90have peculiar arithmetic. e.g.: 1·X can Overflow if | X | is big enough, ≈ 10 2466 Abbreviated multiply, composite divide:X/Y —> ≈ X·(1/Y) .Consequently, absent Over/Underflow or 0/0 , -1 ≤ X/ √ ( X 2 + Y 2 ) ≤ 1 despite 5 rounding errorson all H-P calculators since 1976 and on EVERYcommercially significant computer EXCEPT a CRAY. ( Proof of inequality easy only with IEEE 754.)Floating-Point for CS 267 February 8, 1996 11:50 amSlide 9 CRAYs Lack GUARD DIGIT for Subtraction: Pretend 4 sig. dec.; compute 1.000 - 0.9999 :With guard digit: 1.000- 0.9999 ----------------------- 0.0001 —> 1.000 · 10 -4 Without guard digit 1.000 —> 1.000- 0.9999 —> - 0.999 ---------------------------- ---------------------- 0.001 —> 1.000 · 10 -3 Violates Theorem: If P and Q are floating-point numbers inthe same format, and if 1/2 ≤ P/Q ≤ 2 ,then P - Q is computable Exactly unless it Underflows ( which it can’t in IEEE 754 ).Floating-Point for CS 267 February 8, 1996 11:50 amSlide 10 Programs that can FAIL only on a CRAY for lack of a guard bit:Computations with Divided Differences.Area and Angles of a Triangle, given its side-lengths.Roundoff suppression in solutions of Initial-Value Problems.Software simulations of Doubled-Double precision.Divide-and-Conquer Symmetric Eigenproblems ( Ming Gu’s ) cured in LAPACK by performing operation X := (X+X) - Xto shear off X’s last digit only on CRAYs ( and hex. IBM 3090 ).Floating-Point for CS 267 February 8, 1996 11:50 amSlide 11 Why is CRAY’s arithmetic so Aberrant ? Aberration “justified” by misapplication of principles behind ... Backward Error Analysis: The computed value F(X) of a desired function f(X) is often acceptable if F(X) = f(X’) for some (unknown) X’ practically indistinguishable from X . For example, the solution f of the linear
View Full Document