CS421 Lecture 5–Lexical AnalysisCS421 Lecture 5 Lexical Analysis▶Today’s class▶Today s class▶ Lexing▶FiniteState Machine as Lexer▶Finite-State Machine as LexerCompiler OutlineCompiler Outline▶Front-End▶FrontEnd▶ Takes Input Source Code▶Returns Abstract Syntax Tree▶Returns Abstract Syntax Tree▶ Back-End▶ Takes Abstract Syntax Tree▶ Returns Machine Executable Binary CodeCompiler Outline (Figure)Compiler Outline (Figure)Manual and automatic methodsManual and automatic methods▶ We will study how to write lexers and yparsers. For each, we will give a manual technique and an automatic one:q▶ Lexing:▶Manual: Finite-state machinesManual: Finitestate machines▶ Automatic: Regular expressions – ocamllex▶Parsing▶Parsing▶ Manual: Top-down (recursive descent) parsingAt ti Btt(LR(1))l▶Automatic: Bottom-up (LR(1)) -ocamlyaccLexerLexer▶Divide input into“tokens”▶Divide input into tokens▶ Tokens are smallest units that are useful for parsing E g parser needs to know iffor parsing. E.g. parser needs to know if “while” keyword appears; doesn’t need to know that it is made up of characters w hknow that it is made up of characters w, h, etc.Wh ? Effi i▶Why? Efficiency▶ Simpler to specify grammatical structure, and il i fkimplement parser, in terms of tokensLexer Input & OutputLexer Input & Output▶Lexer Input▶Lexer Input▶ Character stream in the form of▶Input Stream or▶Input Stream, or▶ String▶Lexer Output▶Lexer Output▶ Stream of tokens, or▶List of tokens▶List of tokensTokenstype token = EOF | BOOLEAN | BREAK | CASE | CHAR | CLASS | CONST | CONTINUE| DO | DOUBLE | ELSE | EXTENDS | FINAL | FINALLY | FLOAT | FOR | DEFAULT | IMPLEMENTS | IMPORT | INT | NEW | IF | PUBLIC | SWITCH | RETURN | VOID | STATIC | WHILE | THIS| SWITCH | RETURN | VOID | STATIC | WHILE | THIS| NULL_LITERAL | LPAREN | RPAREN | LBRACE | RBRACE | LBRACK | RBRACK | SEMICOLON | COMMA | DOT | EQ | GT | LT | NOT | COMP | QUESTION | COLON | EQEQ | LTEQ | GTEQ | NOTEQ | ANDAND | OROR| PLUSPLUS | MINUSMINUS | PLUS | MINUS | MULT | DIV | AND | OR | XOR | MOD | LSHIFT | RSHIFT | URSHIFT | PLUSEQ | MINUSEQ | MULTEQ | DIVEQ | ANDEQ | OREQ | XOREQ | MODEQ | LSHIFTEQ | RSHIFTEQ|| || | | || URSHIFTEQ| BOOLEAN_LITERAL of bool| INTEGER_LITERAL of int| FLOAT LITERAL of float| FLOAT_LITERAL of float| IDENTIFIER of string| STRING_LITERAL of stringExampleExample▶Input▶Input“class MP1 { public static void main ( ……”▶ Output[CLASS; IDENTIFIER“MP1”; LBRACE; PUBLIC;[CLASS; IDENTIFIER MP1 ; LBRACE; PUBLIC; STATIC; VOID; IDENTIFIER “main”; LPAREN; …… ]Lexing with FSMLexing with FSM▶Words recognized by corresponding finite▶Words recognized by corresponding finite state automaton▶Deterministic Finite Automaton (DFA)▶Deterministic Finite Automaton (DFA)▶ A directed graph whose verticesare labeled from a setTokens U {Error}and whoseedgesfrom a set Tokens U {Error}and whose edgesare labeled with sets of characters. Also, if the outgoing edges from vertex v are {e1,…,en},outgoing edges from vertex v are {e1, …, en}, then the sets label(e1), …, label(en) are disjoint. Also, a vertex u specified as the start vertex.Example 1Example 1▶DFA for keywords▶DFA for keywords class case finallyExample 2Example 2▶DFA for Operators▶DFA for Operators ; { + += < <= << <<=Example 3Example 3▶DFA for integer constants▶DFA for integer constantsExample 4Example 4▶DFA for integers and decimal▶DFA for integers and decimalCompleting the DFACompleting the DFA▶Need to create a single DFA for all tokens–▶Need to create a single DFA for all tokens recall that all outgoing edges must have disjoint label setsdisjoint label sets.▶ DFA labels are similar to tokens in the token data type but not necessarily identicaldata type, but not necessarily identical▶ For keyword & identifiers:▶ Instead of creating the DFA shown earlier, create a small DFA and use action to distinguish keywordsdistinguish keywordsImplementing lexing with a DFAImplementing lexing with a DFA▶ Define a transition function▶ state x character -> state u {-1}▶ Label▶ state -> token u {error}▶ Assume start state = 0Implementing lexing with a DFAImplementing lexing with a DFAFunction to get a single token:(state x string) getnexttoken() {s = 0; tokenchars = “”;hil ( ) {while (true) {c = peek at next charif (move(s c) ==1)if (move(s,c) == -1)return (s,tokenchars)move c from input to tokencharsmove c from input to tokencharss = move(s,c)}Implementing lexing with a DFAImplementing lexing with a DFAtoken list gettokens() {tokenlis = []while (true) {c = peek at next charpif (c == eofchar) {tokenlis = tokenlis @ [EOF]breakbreak}(s, tokenchars) = getnexttoken()perform action based on s and tokenchars}return tokenlis}Typical lexer actionsTypical lexer actions▶Recall that label(s) is a token or error▶Recall that label(s) is a token or error. Action depends on that label, e.g.:▶Error: Represents an erroneous input; abort▶Error: Represents an erroneous input; abort.▶ LTLT:▶ IDENT:▶ COMMENTMore DFAsMore DFAsMore DFAsMore DFAsMore DFAsMore
View Full Document