1SyntaxAmer DiwanWhat is syntax?• Syntax– Determines the form of programs• Semantics– Determines the meaning of programs– Not all syntactically correct programs are “meaningful”– e.g., int i; i.m();is syntactically correct but semantically meaningless2Compilation and syntaxCharacter streamScanner (lexical anal)Token streamParser (syntax analysis)Parse treeSemantic analysis andIR generationASTMachine-independent optimizationsModified intermediate formTarget code generationAssembly or machine languageMachine-specificoptimizations Modified assemblySyntaxCSCI4555Lexical analysis• What are tokens?– Variables (x, i, theCount, ...)– Keywords (while, begin, ...)– Numbers (10, 1.5E10, ...)– Other symbols ({, ...)• How do languages specify legal tokens?(what is a legal variable name?)3Specifying tokens with regular expressions• What is a regular expression?– A character (e.g., opening_brace = {)– An empty string, denoted ?– Two regular expressions next to each other (e.g., auto_incr = plus plus)– Two regular expressions separated by a vertical bar (e.g., binary_op = plus | minus)– A regular expression followed by a kleene-star (e.g., integer = digit digit*)Examples of regular expressions• variable_name = (a|b|...|z|A|B|...|Z)(a|b|...|z|A|B|...|Z|0|...|9)*What kinds of variable names does this allow?• digit = 0|1|...|9unsigned_integer = digit digit*unsigned_number = unsigned_integer((. unsigned_integer) | ?)4Recognizing tokens: Ad-hoc approach(small grammars)• getWord(InputStream is) {...}• getNumber(InputStream is) {...}• matchKeyword(String s) {if (s.equal(“BEGIN”)) ...else if (s.equal(“END”)) ......}• if is_num(next_char) return new Number(getNumber())else { word = getWord();if matchKeyword(word) then ...else return new Identifier(word)}Recognizing tokens(larger grammars)• Build (or have a tool build) a scanner based on a DFAstartcharacterdigit•Want the longest token•myVarand not m or my or myV or ...5Language choices may complicate lexical analysis• FORTRAN (pre ‘90)– DO 5 I = 1.25assigns a value 1.25 to variable DO5I (spaces are ignored)– May lead to programmer errors!• Pascal: is 4.– Start of a floating point number (4.51)?– Half way through a subrange (4..20)?– Need to look ahead 2 charactersParsing• What are the possible statements of a language?– Addition (a+b)– While loop (while (b) { ... })– For loop (for (i = 0; i < 10; ++i) { ...})– Class declaration (class S extends T {...})– ...• How do languages specify legal statements?6Context-free grammars• regular expressions + recursion• expr -> identifier | number | - expr| expr operator exproperator -> + | - | * | /• Symbols that appear on lhs are non-terminals• Symbols that appear only on rhs are terminals– A scanner returns a stream of terminals to the parserParse tree for -x+y expr- exprexpr operator expridentifier(x) identifier(y)+Does this match what you would mean with -x+y?expr-> identifier | number | - expr| exproperator exproperator -> + | - | * | /7Another parse for -x+yexprexpr operator expridentifier(y)+- expridentifier(x)Does this match what you would mean with -x+y?expr-> identifier | number | - expr| exproperator exproperator -> + | - | * | /Ambiguity in grammars• When more than one production can be used– Resolution mechanisms:• Precedence (* has higher precedence than +)• Associativity (10 - 4 - 3 means (10 - 4) - 3)• Let’s rewrite the grammar to eliminate ambiguities810-4-3number (10)number(4)number(3)-a nonterminala nonterminal-Desired parse:“-” groups to the “left” (left associative)instead of expr -> expr operator expr, haveexpr -> expr operator term | termterm -> identifier | number10-4*3number (10)number(4)number(3)*a nonterminala nonterminal-Desired parse:Note how the higher precedence operators come lower in the derivation treeexpr -> expr add_op term | termterm -> term mult_op factor | factorfactor -> identifier | number | - factor | ( expression)9Parsing using the grammarTop down parsing• Predict non-terminal and match– id_list -> id id_list_tailid_list_tail -> , id id_list_tail | ;match_id_list() {match_id(); match_id_list_tail(); }match_id_list_tail() {if (cur_token == comma) {match_comma(); match_id(); match_id_list_tail();} else if (cur_token == semicolon) {match_semicolon();} else error();– More intuitive (when grammar is suitable)ParsingWhat next?• What does the parser in the previous slide do?– x, y, z; Yup!– x,, y, z: Nope!– x y, z: Nope!• What else needs to happen in a parser?– Build a parse or abstract syntax tree for later use10Building an AST• E.g., AST for:<ul><li> item 1 </li><li> item 2</li></ul> symbol tableitemsparentListNodetextparenttextparentPlainTextNodePlainTextNode“item 1” “item 2”StringStringitem valuenextitem valuenextparent parentListItemListItemsymbol tablesymbol tableTop down parsingWhen is the grammar suitable?• When you can predict the next production with a fixed (small) number of lookaheads• How many lookaheads do we need?– id_list -> id id_list_tailid_list_tail -> , id id_list_tail | ;– id_list -> id_list, id; | id;– expr -> expr add_op term | termterm -> term mult_op factor | factorfactor -> identifier | number | - factor | (expression)11An example• A -> Aa | Aab | cAnother example• S -> a S a | epsilon12Bottom-up parsing• Look at token stream and construct non-terminals– id_list -> id id_list_tailid_list_tail -> , id id_list_tail | ;– Parsing A,B• “A”: doesn’t match the rhs of any production• “A,”: ditto• “A,B”: ditto• “A,B;”: “;” matches a rhs of id_list_tail• “A,B id_list_tail”: ditto• “A id_list_tail”: matches rhs of id_list. Done!More on bottom-up parsing• Can handle more grammars than top-down since there is no prediction• We will look at only top-down in this course13Language design and parsing• Language design can have a significant effect on the complexity of parsers– Languages that are not amenable to straightforward parsing may also interfere with human readability– e.g., Pascal if-then-else statementsif (x) then if (y) thenwrite(“y’s then”);elsewrite (“whose else??”);Relationship to other courses• CSCI 4555: – More emphasis on using parser generation tools for larger grammars than we will use here14Relationship to readings• Reading covers parsing and lexical analysis in more detail– Gives more examples for both–
View Full Document