DOC PREVIEW
Berkeley COMPSCI 61C - Lecture Notes

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 61C L16 Floating Point II (1) Wawrzynek Spring 2006 © UCB2/24/2006John Wawrzynek(www.cs.berkeley.edu/~johnw)www-inst.eecs.berkeley.edu/~cs61c/CS61C – Machine StructuresLecture 16 - Floating Point Numbers IICS 61C L16 Floating Point II (2) Wawrzynek Spring 2006 © UCBIEEE 754 Floating Point Standard (review)° Biased Notation, where bias is numbersubtracted to get real number• IEEE 754 uses bias of 127 for single precision• Subtract 127 from Exponent field to get actual valuefor exponent• 1023 is bias for double precision°Summary (single precision):031SExponent30 23 22Significand1 bit 8 bits 23 bits(-1)S x (1 + Significand) x 2(Exponent-127)Double precision identical, except withexponent bias of 1023CS 61C L16 Floating Point II (3) Wawrzynek Spring 2006 © UCBExample: Converting Binary FP to Decimal°Sign: 0 => positive°Exponent:• 0110 1000two = 104ten• Bias adjustment: 104 - 127 = -23°Significand:1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22= 1.0 + 0.6661150 0110 1000 101 0101 0100 0011 0100 0010°Represents: 1.666115ten*2-23 ~ 1.986*10-7(about 2/10,000,000)CS 61C L16 Floating Point II (4) Wawrzynek Spring 2006 © UCBExample: Converting Decimal to FP1. Denormalize: -23.406252. Convert integer part:23 = 16 + ( 7 = 4 + ( 3 = 2 + ( 1 ) ) ) = 1011123. Convert fractional part:.40625 = .25 + ( .15625 = .125 + ( .03125 ) ) = .0110124. Put parts together and normalize:10111.01101 = 1.011101101 x 245. Convert exponent: 127 + 4 = 10000011211000 0011 011 1011 0100 0000 0000 0000-2.340625 x 101CS 61C L16 Floating Point II (5) Wawrzynek Spring 2006 © UCBRepresentation for +/- Infinity°In FP, divide by zero should produce+/- infinity, not overflow.°Why?• OK to do further computations withinfinity e.g., X/0 > Y may be a validcomparison°IEEE 754 represents +/- infinity• Largest positive exponent reserved forinfinity• Significands all zeroesCS 61C L16 Floating Point II (6) Wawrzynek Spring 2006 © UCBRepresentation for 0°Represent 0?• exponent all zeroes• significand all zeroes• What about sign? Both cases valid.+0: 0 00000000 00000000000000000000000-0: 1 00000000 00000000000000000000000CS 61C L16 Floating Point II (7) Wawrzynek Spring 2006 © UCBSpecial Numbers°What have we defined so far? (Single Precision)Exponent Significand Object0 0 00 nonzero ???1-254 anything +/- fl. pt. #255 0 +/- infinity255 nonzero ???°Professor Kahan had clever ideas;“Waste not, want not”• We’ll talk about Exp=0,255 & Sig!=0 laterCS 61C L16 Floating Point II (8) Wawrzynek Spring 2006 © UCBPrecision and AccuracyPrecision is a count of the number bits in acomputer word used to represent a value.Accuracy is a measure of the differencebetween the actual value of a number andits computer representation.Don’t confuse these two terms!High precision permits high accuracy but doesn’t guarantee it. It is possible to have high precisionbut low accuracy. Example:float pi = 3.14;pi will be represented using all 24 bits of thesignificant (highly precise), but is only anapproximation (not accurate).CS 61C L16 Floating Point II (9) Wawrzynek Spring 2006 © UCBAdministrivia°Midterm 1, 1 Pimentel, Tonight 6-8pmsharp• Open Book/Notes, but no electronic devicesof any kind!°Don’t forget to work on homework andstart project 3 over the weekend.CS 61C L16 Floating Point II (10) Wawrzynek Spring 2006 © UCBRepresentation for Not a Number°What do I get if I calculate sqrt(-4.0)or 0/0?• If infinity is not an error, these shouldn’tbe either.• Called Not a Number (NaN)• Exponent = 255, Significand nonzero° Why is this useful?• Hope NaNs help with debugging?• They contaminate: op(NaN,X) = NaNCS 61C L16 Floating Point II (11) Wawrzynek Spring 2006 © UCBSpecial Numbers (cont’d)°What have we defined so far?(Single Precision)?Exponent Significand Object0 0 00 nonzero ???1-254 anything +/- fl. pt. #255 0 +/- infinity255 nonzero NaNCS 61C L16 Floating Point II (12) Wawrzynek Spring 2006 © UCBRepresentation for Denorms (1/2)°Problem: There’s a gap amongrepresentable FP numbers around 0• Smallest representable pos num:a = 1.0… 2 * 2-126 = 2-126• Second smallest representable pos num:b = 1.000……1 2 * 2-126 = 2-126 + 2-149a - 0 = 2-126b - a = 2-149ba0+-Gaps!CS 61C L16 Floating Point II (13) Wawrzynek Spring 2006 © UCBRepresentation for Denorms (2/2)°Solution:• We still haven’t used Exponent = 0,Significand nonzero• Denormalized number: no (implied)leading 1, exponent = -126.• Smallest representable pos num:a = 2-149• Second smallest representable pos num:b = 2-1480+-CS 61C L16 Floating Point II (14) Wawrzynek Spring 2006 © UCBRounding°When we perform math on realnumbers, we have to worry aboutrounding to fit the result in thesignificant field.°The FP hardware carries two extra bitsof precision, and then round to get theproper value°Rounding also occurs when converting: double to a single precision value, or floating point number to an integerCS 61C L16 Floating Point II (15) Wawrzynek Spring 2006 © UCBIEEE FP Rounding Modes°Round towards +infinity• ALWAYS round “up”: 2.001 → 3-2.001 → -2°Round towards -infinity• ALWAYS round “down”: 1.999 → 1,-1.999 → -2°Truncate• Just drop the last bits (round towards 0)°Round to (nearest) even• Normal rounding, almostCS 61C L16 Floating Point II (16) Wawrzynek Spring 2006 © UCBRound to Even°Round like you learned in grade school°Except if the value is right on theborderline, in which case we round tothe nearest EVEN number2.5 → 23.5 → 4°Insures fairness on calculation• This way, half the time we round up on tie,the other half time we round down• Tends to balance out inaccuraciesThis is the default rounding modeCS 61C L16 Floating Point II (17) Wawrzynek Spring 2006 © UCBCasting floats to ints and vice versa(int) floating point expressionCoerces and converts it to the nearestinteger (C uses truncation)i = (int) (3.14159 * f);(float) expressionconverts integer to nearest floating pointf = f + (float) i;CS 61C L16 Floating Point II (18) Wawrzynek Spring 2006 © UCBint → float → int°Will not always print “true”°Most large values of integers don’thave exact floating pointrepresentations°What about double?if (i == (int)((float) i)) { printf(“true”);}CS 61C L16 Floating Point II (19) Wawrzynek Spring 2006 © UCBfloat → int → float°Will not always print “true”°Small


View Full Document

Berkeley COMPSCI 61C - Lecture Notes

Documents in this Course
SIMD II

SIMD II

8 pages

Midterm

Midterm

7 pages

Lecture 7

Lecture 7

31 pages

Caches

Caches

7 pages

Lecture 9

Lecture 9

24 pages

Lecture 1

Lecture 1

28 pages

Lecture 2

Lecture 2

25 pages

VM II

VM II

4 pages

Midterm

Midterm

10 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?