15 213 The course that gives CMU its Zip Floating Point Arithmetic Sept 24 1998 Topics class10 ppt IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Alpha floating point Floating Point Puzzles For each of the following C expressions either Argue that is true for all argument values Explain why not true x int float x int x float f double d Assume neither d nor f is NAN x int double x f float double f d float d f f 2 3 2 3 0 d 0 0 d 2 0 0 d f f d d d 0 0 d f d f class10 ppt 2 CS 213 F 98 IEEE Floating Point IEEE Standard 754 Estabilished in 1985 as uniform standard for floating point arithmetic Before that many idiosyncratic formats Supported by all major CPUs Driven by Numerical Concerns Nice standards for rounding overflow underflow Hard to make go fast Numercial analysts predominated over hardware types in defining standard class10 ppt 3 CS 213 F 98 Fractional Binary 2 Numbers 2 i i 1 bi bi 1 b2 b1 4 2 1 b0 b 1 b 2 b 3 1 2 1 4 1 8 b j Representation 2 j Bits to right of binary point represent fractional powers of 2 i Represents rational number bk 2 k k j class10 ppt 4 CS 213 F 98 Value 5 3 4 Fractional Binary Number Examples Representation 2 7 8 63 64 101 112 10 1112 0 1111112 Observation Divide by 2 by shifting right Numbers of form 0 111111 2 just below 1 0 Use notation 1 0 Limitation Can only exactly represent numbers of the form x 2k Other numbers have repeating bit representations Value 1 3 Representation 0 0101010101 01 2 1 5 0 001100110011 0011 2 1 10 0 0001100110011 0011 2 class10 ppt 5 CS 213 F 98 Floating Point Representation Numerical Form 1s m 2E Sign bit s determines whether number is negative or positive Mantissa m normally a fractional value in range 1 0 2 0 Exponent E weights value by power of two Encoding s exp significand MSB is sign bit Exp field encodes E Significand field encodes m Sizes Single precision 8 exp bits 23 significand bits 32 bits total Double precision 11 exp bits 52 significand bits 64 bits total class10 ppt 6 CS 213 F 98 Normalized Numeric Values Condition exp 000 0 and exp 111 1 Exponent coded as biased value E Exp Bias Exp unsigned value denoted by exp Bias Bias value Single precision 127 Double precision 1023 Mantissa coded with implied leading 1 m 1 xxx x2 xxx x bits of significand Minimum when 000 0 m 1 0 Maximum when 111 1 m 2 0 Get extra leading bit for free class10 ppt 7 CS 213 F 98 Normalized Encoding Example Value Float F 15213 0 1521310 111011011011012 1 1101101101101 2 X 213 Significand m sig 110110110110100000000002 Exponent E Bias Exp 1 1101101101101 2 13 127 140 100011002 Floating Point Representation Class 02 Hex Binary 0000 140 15213 class10 ppt 4 6 6 D B 4 0 0 0100 0110 0110 1101 1011 0100 0000 100 0110 0 1110 1101 1011 01 8 CS 213 F 98 Condition Denormalized Values exp 000 0 Value Exponent value E Bias 1 Mantissa value m 0 xxx x2 xxx x bits of significand Cases exp 000 0 significand 000 0 Represents value 0 Note that have distinct values 0 and 0 exp 000 0 significand 000 0 Numbers very close to 0 0 Lose precision as get smaller Gradual underflow class10 ppt 9 CS 213 F 98 Description Zero Interesting ExpNumbers Significand Numeric Value 00 00 00 00 0 0 Smallest Pos Denorm 00 00 00 01 Single 1 4 X 10 45 Double 4 9 X 10 324 2 23 52 X 2 126 1022 Largest Denormalized 00 00 11 11 Single 1 18 X 10 38 Double 2 2 X 10 308 1 0 X 2 126 1022 Smallest Pos Normalized 00 01 00 00 Just larger than largest denormalized 1 0 X 2 126 1022 One 1 0 01 11 00 00 Largest Normalized 11 10 11 11 Single 3 4 X 1038 Double 1 8 X 10308 class10 ppt 10 2 0 X 2 127 1023 CS 213 F 98 Memory Referencing Bug From Class 01 Example main main long long int int a 2 a 2 double double dd 3 14 3 14 a 2 1073741824 a 2 1073741824 Out Out of of bounds bounds reference reference printf d printf d 15g n 15g n d d exit 0 exit 0 Alpha MIPS Sun g 5 30498947741318e 315 3 1399998664856 3 14 O 3 14 3 14 class10 ppt 3 14 11 CS 213 F 98 Referencing Bug on Alpha Alpha Stack Frame g d a 1 a 0 long int a 2 double d 3 14 a 2 1073741824 Optimized Code Double d stored in register Unaffected by errant write Alpha g 1073741824 0x40000000 230 Overwrites all 8 bytes with value 0x0000000040000000 Denormalized value 230 X smallest denorm 2 1074 2 1044 5 305 X 10 315 class10 ppt 12 CS 213 F 98 Referencing Bug on MIPS MIPS Stack Frame g d a 1 a 0 long int a 2 double d 3 14 a 2 1073741824 MIPS g Overwrites lower 4 bytes with value 0x40000000 Original value 3 14 represented as 0x40091eb851eb851f Modified value represented as 0x40091eb840000000 Exp 1024 E 1024 1023 1 Mantissa difference 0000011eb851f16 Integer value 11eb851f16 300 647 71110 Difference 21 X 2 52 X 300 647 711 1 34 X 10 7 Compare to 3 140000000 3 139999866 0 000000134 class10 ppt 13 CS 213 F 98 Condition Special Values exp 111 1 Cases exp 111 1 significand 000 0 Represents value infinity Operation that overflows Both positive and negative E g 1 0 0 0 1 0 0 0 1 0 0 0 exp 111 1 significand 000 0 Not a Number NaN Represents case when no numeric value can be determined E g sqrt 1 No fixed meaning assigned to significand bits class10 ppt 14 CS 213 F 98 Special Properties of FP Zero Same as Integer Zero Encoding All bits 0 Can Almost Use Unsigned Integer Comparison Must first compare sign bits NaNs problematic Will be greater than any other values What should comparison yield Otherwise OK Denorm vs normalized Normalized vs infinity class10 ppt 15 CS 213 F 98 Floating Point Operations Conceptual View First compute exact result Make it fit into desired precision Possibly overflow if exponent too large Possibly round to fit into significand Rounding Modes illustrate with rounding Zero 1 40 1 00 1 60 2 00 1 50 1 00 2 50 2 00 1 50 1 00 1 00 2 00 1 00 2 00 2 00 1 00 2 00 2 00 3 00 1 00 1 00 2 00 2 00 2 00 2 00 Nearest Even default class10 ppt 16 CS 213 F 98 A Closer Look at Round To Even Default Rounding Mode Hard to get any other …
View Full Document