UA CSC 453 - Lexical Analysis I - D2177462

Home> Schools> University of Arizona> Computer Science Main (CSC) > CSC 453> Lexical Analysis I

DOC PREVIEW

UA CSC 453 - Lexical Analysis I

School name University of Arizona

Course Csc 453- Compilers and Systems Software

Pages 49

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSc 453Compilers and Systems Software3 : Lexical Analysis IDepartment of Computer ScienceUniversity of [email protected]° 2009 Christian CollbergCompiler PhasesWe are here!ASTasmSemanticAnalyserInterm.Code GenIRMachineCode GenIRASTLexertokenssourceerrorserrorsParsererrorsOptimizeCompiler Phases – Lexical analysisThe lexer reads the source file and divides the text into lexicalunits (tokens), such as:Reserved words BEGIN, IF,. . .identifiers x, StringTokenizer,. . .special characters +, ∗, −, ^,. . .numbers 30, 3.14,. . .comments (* text *),strings "text".Lexical errors (such as ’illegal character’, ’undelimitedcharacter string’, ’comment without end’) are reported.Lexical Analysis of EnglishThe sentence"The boy’s cowbell won’t play."would be translated to the list of tokensthe, boy+possessive, cowbell, will, not, playLexical Analysis of JavaThe sentence"x = 3.14 * (9.0+y);"would be translated to the list of tokens<ID,x>, EQ, <FLOAT,3.14>, STAR, LPAREN,<FLOAT,9.0>, PLUS, <ID,y>, RPAREN, SEMICOLONExample – Lexical AnalysisBreak up the source code (a text file) and into tokens.Source CodeStream of TokensPROCEDURE Foo ();VAR i : INTEGER;BEGINi := 1;WHILE i < 20 DOPRINT i * 2;i := i * 2 + 1;ENDDO;END Foo;PROCEDURE, <id,Foo>, LPAR, RPAR, SC,VAR, <id,i>, COLON, <id,INTEGER>,SC,BEGIN, <id,i>,CEQ,<int,1>,SC,WHILE, <id,i>, LT, <int,20>,DO,PRINT, <id,i>, MUL, <int,2>, SC,<id,i>, CEQ, <id,i>, MUL, <int,2>, PLUS,<int,1>, SC, ENDDO, SC, END, <id,Foo>,ProblemsFree vs. Fixed FormatMost languages are free format, i.e. it does not matter whereon a line of text a certain token occurs.FORTRAN (at least early versions) uses a fixed format wherethe first 6 characters on the input line is a label, and the lastcharacters (columns 72-80) a comment.A "C" in the first column indicates a comment line.Any character in column 6 indicates a continuation line.C Compute the determinant:det = a(1,1) * a(2,2) * a(3,3) + a(1,2) * a(2,3) * a(3,1)& + a(2,1) * a(3,2) * a(1,3) - a(3,1) * a(2,2) * a(1,3)& - a(2,1) * a(1,2) * a(3,3) - a(1,1) * a(3,2) * a(2,3)Free vs. Fixed Format. . .Python, Occam, and some functional languages useindentation to indicate nesting:def quicksort(list, start, end):if start < end:split = partition(list, start, end)quicksort(list, start, split-1)quicksort(list, split+1, end)else:returnWhitespaceIn most modern languages whitespace (blanks and tabs) aresignificant. FORTRAN and Algol-68 are different: whitespace maybe added anywhere to improve readability. The FORTRANstatementDO 5 I = 1.25is an assignment statement, meaning the same as:DO5I = 1.25This statement, on the other hand, is a loop statement:DO 5 I = 1,25...5 CONTINUEWhitespace. . .An error in a single FORTRAN statement resulted in the loss ofthe first American probe to Venus (the Mariner I).....DO 5 K = 1. 3T(K) = W0Z = 1.0/(X**2)*B1**2+3.0977E-4*B0**2D(K) = 3.076E-2*2.0*(1.0/X*B0*B1+3.0977E-4**(B0**2-X*B0*B1))/ZE(K) = H**2*93.2943*W0/SIN(W0)*ZH = D(K)-E(K)5 CONTINUEThis is now considered an urban legend.BufferingIf done incorrectly, lexical analysis can be an expensive phaseof the compiler – It is the only phase which actually considerseach and every character of the program.It is, for example, crucial not to read one character at a timefrom the input file. Rather, a large block of the input text filemust be read and but into abuffer. This buffer is then usedto provide the lexer with character.Sometimes the lexer may also need to look ahead at charactersto come before deciding on what token appears next in thetext. The buffer is useful in such circumstances also.KeywordsMost languages have reserved keywords, which means thatthese words may not be redefined by the user.PL/I does not reserve keywords which makes it difficult forthe lexer to distinguish between user-defined identifiers andkeywords:IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;Error handlingWhat do we do when an error is encountered during lexicalanalysis?Panic Skip characters until a well-formed token isfound.Replace Replace an incorrect character.Delete Delete an incorrect character.Insert Insert a missing character.Transpose Switch two characters.CommunicationThe Lexer may communicate with the parser in many differentways.Lexical analysis might, for example, run as a special passwriting the tokens on a temporary file which is read by theparser.Or – and this is probably the most common situation – theparser makes a procedure call to the lexer whenever a token isneeded.The Lexer and the Parser could also run as two concurrentprocesses communicating over a pipe.Transition DiagramsTransition DiagramsE84569IDENT1011INT01723ENDASSIGNCOLONletter digitordigitorletterletter−EotherNDdigitotherdigit:otherotherotherother=TYPE TokenType = (Assign, End, ...);VAR s : (State0, State1, ...);c : CHAR;PROCEDURE GetToken () : TokenType;CASE s OFState0 : c := NextChar();CASE c OF":" : s := State1|"E" : s := State4|"0" .. "9" : s := State10|ELSE : s := State8END|State1 : c := NextChar();IF c="=" THEN s := State2ELSE s := State3 END|State2 : RETURN Assign|State3 : PutChar(c); RETURN Colon|State4 : c := NextChar();IF c = "N" THEN s := State5ELSE s := State8 END|State5 : c := NextChar();IF c = "D" THEN s := State6ELSE s := State8 END|State6 : c := NextChar();IF IsLetterOrDigit(c)THEN s := State8ELSE s := State7 END;|State7 : PutChar(c); RETURN End|State8 : c := NextChar();IF NOT IsLetterOrDigit(c)THEN s := State9 END|State9 : PutChar(c); RETURN Ident|State10 : c := NextChar();IF NOT IsDigit(c)THEN s := State11 END|State11 : PutChar(c); RETURN Int|END;END GetToken;Regular Grammars and LexicalAnalysisRegular GrammarsA grammar is regular if all rules are of the formA → aBA → aBy convention, the symbols A, B, C , . . . are non-terminals,a, b, c, . . . are terminals, and α, β, γ, . . . are strings of symbols.Regular grammars are used to describe the lexical structure ofprograms, i.e. what tokens look like.Regular Grammars. . .The following grammar describes C identifiers:id → letter | letter SS→ letter | letter SS→ digit | digit Sdigit → 0 | 1 | · · · | 9letter→ A | · · · | Z |a | · · · | zHere’s a derivation of the identifier cow5:id ⇒ letter S ⇒ c S ⇒ c letter S ⇒ c o S ⇒c o letter S ⇒ c o w S ⇒ c o w digit ⇒ cow 5Regular Grammars. . .This is a grammar for floating point numbers. As written, it isnot quite regular: We treat digitas a terminal.float → +float1 | - float1 | float1float1→ digit float1 |

View Full Document