CS61C Machine Structures Lecture 16 Floating Point Numbers II 2 24 2006 John Wawrzynek www cs berkeley edu johnw www inst eecs berkeley edu cs61c CS 61C L16 Floating Point II 1 Wawrzynek Spring 2006 UCB IEEE 754 Floating Point Standard review Biased Notation where bias is number subtracted to get real number IEEE 754 uses bias of 127 for single precision Subtract 127 from Exponent field to get actual value for exponent 1023 is bias for double precision Summary single precision 31 30 23 22 Exponent S 1 bit 8 bits 0 Significand 23 bits 1 S x 1 Significand x 2 Exponent 127 Double precision identical except with exponent bias of 1023 CS 61C L16 Floating Point II 2 Wawrzynek Spring 2006 UCB Example Converting Binary FP to Decimal 0 0110 1000 101 0101 0100 0011 0100 0010 Sign 0 positive Exponent 0110 1000two 104ten Bias adjustment 104 127 23 Significand 1 1x2 1 0x2 2 1x2 3 0x2 4 1x2 5 1 2 1 2 3 2 5 2 7 2 9 2 14 2 15 2 17 2 22 1 0 0 666115 Represents 1 666115ten 2 23 1 986 10 7 about 2 10 000 000 CS 61C L16 Floating Point II 3 Wawrzynek Spring 2006 UCB Example Converting Decimal to FP 2 340625 x 101 1 Denormalize 23 40625 2 Convert integer part 23 16 7 4 3 2 1 101112 3 Convert fractional part 40625 25 15625 125 03125 011012 4 Put parts together and normalize 10111 01101 1 011101101 x 24 5 Convert exponent 127 4 100000112 1 1000 0011 011 1011 0100 0000 0000 0000 CS 61C L16 Floating Point II 4 Wawrzynek Spring 2006 UCB Representation for Infinity In FP divide by zero should produce infinity not overflow Why OK to do further computations with infinity e g X 0 Y may be a valid comparison IEEE 754 represents infinity Largest positive exponent reserved for infinity Significands all zeroes CS 61C L16 Floating Point II 5 Wawrzynek Spring 2006 UCB Representation for 0 Represent 0 exponent all zeroes significand all zeroes What about sign Both cases valid 0 0 00000000 00000000000000000000000 0 1 00000000 00000000000000000000000 CS 61C L16 Floating Point II 6 Wawrzynek Spring 2006 UCB Special Numbers What have we defined so far Single Precision Exponent Significand Object 0 0 0 0 nonzero 1 254 255 anything 0 fl pt infinity 255 nonzero Professor Kahan had clever ideas Waste not want not We ll talk about Exp 0 255 Sig 0 later CS 61C L16 Floating Point II 7 Wawrzynek Spring 2006 UCB Precision and Accuracy Don t confuse these two terms Precision is a count of the number bits in a computer word used to represent a value Accuracy is a measure of the difference between the actual value of a number and its computer representation High precision permits high accuracy but doesn t guarantee it It is possible to have high precision but low accuracy Example float pi 3 14 pi will be represented using all 24 bits of the significant highly precise but is only an approximation not accurate CS 61C L16 Floating Point II 8 Wawrzynek Spring 2006 UCB Administrivia Midterm 1 1 Pimentel Tonight 6 8pm sharp Open Book Notes but no electronic devices of any kind Don t forget to work on homework and start project 3 over the weekend CS 61C L16 Floating Point II 9 Wawrzynek Spring 2006 UCB Representation for Not a Number What do I get if I calculate sqrt 4 0 or 0 0 If infinity is not an error these shouldn t be either Called Not a Number NaN Exponent 255 Significand nonzero Why is this useful Hope NaNs help with debugging They contaminate op NaN X NaN CS 61C L16 Floating Point II 10 Wawrzynek Spring 2006 UCB Special Numbers cont d What have we defined so far Single Precision Exponent Significand Object 0 0 0 0 nonzero 1 254 255 anything 0 fl pt infinity 255 nonzero NaN CS 61C L16 Floating Point II 11 Wawrzynek Spring 2006 UCB Representation for Denorms 1 2 Problem There s a gap among representable FP numbers around 0 Smallest representable pos num a 1 0 2 2 126 2 126 Second smallest representable pos num b 1 000 1 2 2 126 2 126 2 149 a 0 2 126 b a 2 149 CS 61C L16 Floating Point II 12 Gaps b 0 a Wawrzynek Spring 2006 UCB Representation for Denorms 2 2 Solution We still haven t used Exponent 0 Significand nonzero Denormalized number no implied leading 1 exponent 126 Smallest representable pos num a 2 149 Second smallest representable pos num b 2 148 CS 61C L16 Floating Point II 13 0 Wawrzynek Spring 2006 UCB Rounding When we perform math on real numbers we have to worry about rounding to fit the result in the significant field The FP hardware carries two extra bits of precision and then round to get the proper value Rounding also occurs when converting double to a single precision value or floating point number to an integer CS 61C L16 Floating Point II 14 Wawrzynek Spring 2006 UCB IEEE FP Rounding Modes Round towards infinity ALWAYS round up 2 001 3 2 001 2 Round towards infinity ALWAYS round down 1 999 1 1 999 2 Truncate Just drop the last bits round towards 0 Round to nearest even Normal rounding almost CS 61C L16 Floating Point II 15 Wawrzynek Spring 2006 UCB Round to Even Round like you learned in grade school Except if the value is right on the borderline in which case we round to the nearest EVEN number 2 5 2 3 5 4 Insures fairness on calculation This way half the time we round up on tie the other half time we round down Tends to balance out inaccuracies This is the default rounding mode CS 61C L16 Floating Point II 16 Wawrzynek Spring 2006 UCB Casting floats to ints and vice versa int floating point expression Coerces and converts it to the nearest integer C uses truncation i int 3 14159 f float expression converts integer to nearest floating point f f float i CS 61C L16 Floating Point II 17 Wawrzynek Spring 2006 UCB int float int if i int float i printf true Will not always print true Most large values of integers don t have exact floating point representations What about double CS 61C L16 Floating Point II 18 Wawrzynek Spring 2006 UCB float int float if f float int f printf true Will not always print true Small floating point numbers 1 don t have integer representations For other numbers rounding errors CS 61C L16 Floating Point II 19 Wawrzynek Spring 2006 UCB Floating Point Fallacy FP add associative FALSE x 1 5 x 1038 y 1 5 x 1038 and z 1 0 x y z 1 5x1038 1 5x1038 1 0 1 5x1038 1 5x1038 0 0 x y z 1 5x1038 1 5x1038 1 0 0 0 1 0 1 0 Therefore Floating Point add is not associative Why FP result approximates real result This example …
View Full Document
Unlocking...