1Floating Point NumbersSummer 2008CMPE12 – Summer 2008 – Slides by ADB 2Fractional numbers Fractional numbers – fixed point Floating point numbers – the IEEE 754 floating point standard Floating point operations Rounding modes2CMPE12 – Summer 2008 – Slides by ADB 3Positional representation of fractional numbers In base 102102Decimal pointNumber65431Position-4-3-2-1013Multiplier10-410-310-210-1100101103CMPE12 – Summer 2008 – Slides by ADB 4Positional representation of fractional numbers In base 22122Binary pointNumber10110Position-4-3-2-1013Multiplier2-42-32-22-12021233CMPE12 – Summer 2008 – Slides by ADB 5Fractional numbers – fixed point Fixed-point representation How much information is necessary to store? How do you choose a format for the bits? Fixed-point operations Addition Align binary points, and add straight down Multiplication ???CMPE12 – Summer 2008 – Slides by ADB 6Decimal to binary conversion Convert A = 3.141510to base 24CMPE12 – Summer 2008 – Slides by ADB 7Fixed-point number densityCMPE12 – Summer 2008 – Slides by ADB 8Scientific notation In base 10 Example: 3.0 × 108In base 2 Example: –1.00101 × 24(= –18.510) The general form r = Sign × signiFicand × baseExponent5CMPE12 – Summer 2008 – Slides by ADB 9Single-precision IEEE 754 floating-point numberstnenopxe30 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224 One-bit sign Eight-bit exponent 23-bit significand That’s the fractional partCMPE12 – Summer 2008 – Slides by ADB 10Single-precision IEEE 754 floating-point numbers Normalized numbers: only one non-zero bit to the left of the binary point Adjust the exponent as needed r = (–2)S× F ×2EImplicit leading 1 in the significand (the “hidden bit”) r = (–2)S×(1 + F)×2EBias notation to represent the exponent With the bias B = 127 r = (–2)S×(1 + F)×2E-B6CMPE12 – Summer 2008 – Slides by ADB 11How to convert a base-10 number into IEEE 754 single-precision floating point Convert the number to binary The big part And the fractional part Normalize Isolate the hidden one Remove the significand’s hidden one Add bias to the exponent Represent the numberCMPE12 – Summer 2008 – Slides by ADB 12Example: 12.62530 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224 Convert to binary Normalize Remove hidden one Add bias exponent The end7CMPE12 – Summer 2008 – Slides by ADB 13Double-precision IEEE 754 floating-point numberstnenopxe63 62 61 60 59 58Sign64 56significand333435363738394041424344454647484950515253545557 One-bit sign Eleven-bit exponent 52-bit significand That’s the fractional part30 29 28 27 26 2531 23significand, continued0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 14Summary of IEEE 754 formats≥ 7964≥ 4332Total bits≤ –16382–1022≤ –1022–126Emin≥ +16383+1023≥ +1023+127Emax≥ 1511≥ 118Bits for E≥ 6453≥ 3224Bits for FDouble Ext.DoubleSingle Ext.SingleParameterPrecision For every precision, there are reserved exponents, used for special quantities: Emin– 1 (i.e., E=0) is used for zero and denorms Emax+ 1 (i.e., E=255 or E=2047, with bias) is used for NaN and infinity8CMPE12 – Summer 2008 – Slides by ADB 15Special quantities: Infinity This special quantity avoids halt on overflow Much safer than returning the largest possible number Representation: E = Emax+ 1 E = 255 in single precision with bias E = 2047 in double precision with bias F = 0 Sign (+∞ or –∞)30 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 16Special quantities: Infinity Examples of operations that return ±Inf1 / InfSqrt (+Inf)4 – Inf–1/01/0ResultOperation9CMPE12 – Summer 2008 – Slides by ADB 17Special quantities: NaN (Not a Number) This special quantity avoids halt on invalid operations Representation: E = Emax+ 1 E = 255 in single precision with bias E = 2047 in double precision with bias F ≠ 030 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 18Special quantities: NaN (Not a Number) Examples of operations that return NaNlog(–0)log(+0)1/–Inf3/–03/+0–0/3+0/3ResultOperation10CMPE12 – Summer 2008 – Slides by ADB 19Special quantities: Zero Representation: E = Emin– 1 (i.e., E = 0) F ≠ 0 Sign: +0 or –030 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 20Special quantities: Zero Examples of operations that involve ±0Sqrt/×+NaN produced byOperation11CMPE12 – Summer 2008 – Slides by ADB 21Floating point numbers range What is the largest number we can represent in IEEE 754 single-precision floating point? What is the smallest number?30 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 22Floating point numbers range What is the largest number we can represent in IEEE 754 double-precision floating point? What is the smallest number?12CMPE12 – Summer 2008 – Slides by ADB 23Floating point numbers: density Fact 1: Floats are not reals E.g., 2/3 Fact 2: Floats are not decimals E.g., 0.1 (base 10) = 1.1001100… × 2–4(base 2) Fact 3: Not even all the integers in the range are represented E.g., 100,000,001 (base 10) =1011 1110 1011 1100 0010 0000 0001 (base 2)CMPE12 – Summer 2008 – Slides by ADB 24Floating point numbers: density Close to 0: high density Far from 0: high density13CMPE12 – Summer 2008 – Slides by ADB 25Special quantities: Denormals These are numbers smaller than 2^(Emin) Fill the gap between 2^(Emin) and 0 (gradual underflow) Representation: E = Emin– 1 (i.e., E = 0) F ≠ 0 The number represented is 0.f-1f-2…f-23×2^(Emin)30 29 28 27 26 25Sign31 23significand0123456789011121314151617181920212224CMPE12 – Summer 2008 – Slides by ADB 26Special quantities: Denormals From 0 to 2^(Emin)14CMPE12 – Summer 2008 – Slides by ADB 27Summary of IEEE 754 numbers≠ 02047≠ 0255020470255anything1 – 2046anything1 – 254≠ 00≠ 000000SignificandExponentSignificandExponentObjectDouble precisionSingle precisionCMPE12 – Summer 2008 – Slides
View Full Document