Floating Point Puzzles 15 213 The course that gives CMU its Zip Floating Point Arithmetic September 28 2000 Topics For each of the following C expressions either Argue that is true for all argument values Explain why not true x int float x int x x int double x float f f float double f double d d float d f f IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IA32 floating point 2 3 2 3 0 Assume neither d nor f is NAN d 0 0 d 2 0 0 d f f d d d 0 0 d f d f class10 ppt class10 ppt IEEE Floating Point 2 CS 213 F 00 Fractional Binary Numbers 2i 2i 1 IEEE Standard 754 Estabilished in 1985 as uniform standard for floating point arithmetic 4 2 1 Before that many idiosyncratic formats Supported by all major CPUs bi bi 1 Nice standards for rounding overflow underflow b2 b1 b0 b 1 b 2 b 3 1 2 1 4 1 8 Hard to make go fast Numerical analysts predominated over hardware types in defining standard b j Driven by Numerical Concerns 2 j Representation Bits to right of binary point represent fractional powers of 2 Represents rational number i bk 2 k k j class10 ppt 3 CS 213 F 00 class10 ppt 4 CS 213 F 00 Fractional Binary Number Examples Value Numerical Form Representation 5 3 4 2 7 8 63 64 1s M 2E Sign bit s determines whether number is negative or positive Significand M normally a fractional value in range 1 0 2 0 Exponent E weights value by power of two 101 112 10 1112 0 1111112 Observation Encoding Divide by 2 by shifting right Numbers of form 0 111111 2 just below 1 0 Use notation 1 0 s exp MSB is sign bit exp field encodes E frac field encodes M Limitation Can only exactly represent numbers of the form x 2k Other numbers have repeating bit representations Value Single precision 8 exp bits 23 frac bits 32 bits total Double precision 11 exp bits 52 frac bits 64 bits total 0 0101010101 01 2 0 001100110011 0011 2 0 0001100110011 0011 2 class10 ppt 5 frac Sizes Representation 1 3 1 5 1 10 Floating Point Representation CS 213 F 00 class10 ppt 6 CS 213 F 00 Normalized Encoding Example Normalized Numeric Values Value Condition Float F 15213 0 1521310 111011011011012 1 11011011011012 X 213 exp 000 0 and exp 111 1 Exponent coded as biased value Significand E Exp Bias Exp unsigned value denoted by exp Bias Bias value Single precision 127 Exp 1 254 E 126 127 Double precision 1023 Exp 1 2046 E 1022 1023 in general Bias 2m 1 1 where m is the number of exponent bits M frac 1 11011011011012 110110110110100000000002 Exponent E Bias Exp 13 127 140 100011002 Significand coded with implied leading 1 m 1 xxx x2 xxx x bits of frac Minimum when 000 0 M 1 0 Maximum when 111 1 M 2 0 Get extra leading bit for free class10 ppt 7 Floating Point Representation Class 02 Hex Binary 140 CS 213 F 00 4 6 6 D B 4 0 0 0100 0110 0110 1101 1011 0100 0000 0000 100 0110 0 15213 1110 1101 1011 01 class10 ppt 8 CS 213 F 00 Interesting Numbers Denormalized Values Condition exp 000 0 Value Exponent value E Bias 1 Significand value m 0 xxx x2 xxx x bits of frac Cases exp 000 0 frac 000 0 Represents value 0 Note that have distinct values 0 and 0 Description exp Numeric Value Zero 00 00 00 00 frac 2 23 52 X 2 126 1022 Largest Denormalized 00 00 11 11 Single 1 18 X 10 38 Double 2 2 X 10 308 1 0 X 2 126 1022 Smallest Pos Normalized 00 01 00 00 Just larger than largest denormalized 1 0 X 2 126 1022 One 1 0 01 11 00 00 Largest Normalized 11 10 11 11 Single 3 4 X 1038 Double 1 8 X 10308 exp 000 0 frac 000 0 Numbers very close to 0 0 0 0 Smallest Pos Denorm 00 00 00 01 Single 1 4 X 10 45 Double 4 9 X 10 324 2 0 X 2 127 1023 Lose precision as get smaller Gradual underflow class10 ppt 9 class10 ppt CS 213 F 00 10 CS 213 F 00 Summary of Floating Point Real Number Encodings Special Values Condition exp 111 1 Cases exp 111 1 frac 000 0 Represents value infinity Normalized Denorm 0 0 Denorm Normalized Operation that overflows NaN NaN Both positive and negative E g 1 0 0 0 1 0 0 0 1 0 0 0 exp 111 1 frac 000 0 Not a Number NaN Represents case when no numeric value can be determined E g sqrt 1 class10 ppt 11 CS 213 F 00 class10 ppt 12 CS 213 F 00 Tiny floating point example Values related to the exponent 8 bit Floating Point Representation the sign bit is in the most significant bit the next four bits are the exponent with a bias of 7 the last three bits are the frac Same General Form as IEEE Format normalized denormalized representation of 0 NaN infinity 7 6 0 3 2 s exp class10 ppt frac 13 CS 213 F 00 Dynamic Range Value 0 0 Denormalized 0 numbers 0 0 0 0 0 0 Normalized 0 numbers 0 0 0 0 0 0000 000 0000 001 0000 010 6 6 6 0 1 8 1 64 1 512 2 8 1 64 2 512 closest to zero 0000 0000 0001 0001 110 111 000 001 6 6 6 6 6 8 1 64 7 8 1 64 8 8 1 64 9 8 9 64 6 512 7 512 8 512 9 512 largest denorm smallest norm 0110 0110 0111 0111 0111 110 111 000 001 010 1 1 0 0 0 14 8 1 2 15 8 1 2 8 8 1 9 8 1 10 8 1 14 16 15 16 1 9 8 10 8 7 7 n a 14 8 128 224 15 8 128 240 inf frac 1110 110 1110 111 1111 000 class10 ppt 15 exp E 2E 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 6 6 5 4 3 2 1 0 1 2 3 4 5 6 7 n a 1 64 1 64 1 32 1 16 1 8 1 4 1 2 1 2 4 8 16 32 64 128 class10 ppt denorms inf Nan 14 CS 213 F 00 Special Properties of Encoding E s exp Exp FP Zero Same as Integer Zero All bits 0 Can Almost Use Unsigned Integer Comparison Must first compare sign bits Must consider 0 0 NaNs problematic Will be greater than any other values What should comparison yield closest to 1 below closest to 1 …
View Full Document