UT CS 345 - Lexical and Syntactic Analysis - D2189880

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 345> Lexical and Syntactic Analysis

DOC PREVIEW

UT CS 345 - Lexical and Syntactic Analysis

School name University of Texas at Austin

Course Cs 345- Programming Languages

Pages 39

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lexical and Syntactic AnalysisReading AssignmentSyntaxGrammarsExample: Decimal NumbersDerivation of 352 as an IntegerLeftmost DerivationChomsky HierarchyRegular GrammarsLexical AnalysisRegular ExpressionsExamples of Tokens in CReserved Keywords in CAutomatic Scanner GenerationFinite State AutomataDFA for C IdentifiersTraversing a DFAContext-Free GrammarsSyntactic CorrectnessCFG For Floating Point NumbersCFG For Balanced ParenthesesCFG For Decimal Numbers (Redux)Recursive Descent ParsingShift-Reduce ParsingConcrete vs. Abstract SyntaxExpression NotationMixed Expression NotationPostfix, Prefix, Mixfix in Java and CExpression Compilation ExampleSyntactic AmbiguityRemoving AmbiguityThis Grammar Is UnambiguousLeft- and Right-Recursive GrammarsYacc Expression Grammar“Dangling Else” AmbiguitySolving the Dangling Else AmbiguityShift-Reduce Conflicts in YaccAvoiding Yacc WarningMore Powerful Grammarsslide 1Vitaly ShmatikovCS 345Lexical and Syntactic Analysisslide 2Reading AssignmentMitchell, Chapters 4.1C Reference Manual, Chapters 2 and 7slide 3SyntaxSyntax of a programming language is a precise description of all grammatically correct programs• Precise formal syntax was first used in ALGOL 60Lexical syntax• Basic symbols (names, values, operators, etc.)Concrete syntax• Rules for writing expressions, statements, programsAbstract syntax• Internal representation of expressions and statements, capturing their “meaning” (i.e., semantics)slide 4GrammarsA meta-language is a language used to define other languagesA grammar is a meta-language used to define the syntax of a language. It consists of:• Finite set of terminal symbols• Finite set of non-terminal symbols• Finite set of production rules• Start symbol• Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbolBackus-NaurForm (BNF)slide 5Example: Decimal NumbersGrammar for unsigned decimal integers• Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9• Non-terminal symbols: Digit, Integer• Production rules:– Integer → Digit | Integer Digit– Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9• Start symbol: IntegerCan derive any unsigned integer using this grammar• Language = set of all unsigned decimal integersShorthand forInteger → DigitInteger → Integer Digitslide 6Integer ⇒ Integer Digit⇒ Integer 2⇒ Integer Digit 2⇒ Integer 5 2⇒ Digit 5 2⇒ 3 5 2Rightmost derivationAt each step, the rightmostnon-terminal is replacedDerivation of 352 as an IntegerProduction rules:Integer → Digit | Integer DigitDigit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9slide 7Leftmost DerivationInteger ⇒ Integer Digit⇒ Integer Digit Digit⇒ Digit Digit Digit⇒ 3 Digit Digit⇒ 3 5 Digit⇒ 3 5 2At each step, the leftmostnon-terminal is replacedProduction rules:Integer → Digit | Integer DigitDigit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9slide 8Chomsky HierarchyRegular grammars• Regular expressions, finite-state automata• Used to define lexical structure of the languageContext-free grammars• Non-deterministic pushdown automata• Used to define concrete syntax of the languageContext-sensitive grammarsUnrestricted grammars• Recursively enumerable languages, Turing machinesslide 9Regular GrammarsLeft regular grammar• All production rules have the formA →ωor A → Bω– Here A, B are non-terminal symbols, ω is a terminal symbolRight regular grammar• A →ωor A →ωBExample: grammar of decimal integersNot a regular language: {anbn| n ≥ 1 } (why?)What about this: “any sequence of integers where ( is eventually followed by )”?slide 10Lexical AnalysisSource code = long string of ASCII charactersLexical analyzer splits it into tokens• Token = sequence of characters (symbolic name) representing a single terminal symbolIdentifiers: myVariable …Literals: 123 5.67 true …Keywords: char sizeof …Operators: + - * / …Punctuation: ; , } { …Discards whitespace and commentsslide 11Regular Expressionsx character x \x escaped character, e.g., \n{ name } reference to a nameM | N M or NM N M followed by NM* 0 or more occurrences of MM+ 1 or more occurrences of M[x1… xn] One of x1… xn• Example: [aeiou] – vowels, [0-9] - digitsslide 12Examples of Tokens in CLexical analyzer usually represents each token by a unique integer code• “+” { return(PLUS); } // PLUS = 401• “-” { return(MINUS); } // MINUS = 402• “*” { return(MULT); } // MULT = 403• “/” { return(DIV); } // DIV = 404Some tokens require regular expressions• [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier• [1-9][0-9]* { return(DECIMALINT); }• 0[0-7]* { return(OCTALINT); }• (0x|0X)[0-9a-fA-F]+ { return(HEXINT); }slide 13Reserved Keywords in Cauto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, whileC++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and othersEach keyword is mapped to its own tokenslide 14Automatic Scanner GenerationLexer or scanner recognizes and separates lexical tokens• Parser usually calls lexer when it’s ready to process the next symbol (lexer remembers where it left off)Scanner code usually generated automatically • Input: lexical definition (e.g., regular expressions)• Output: code implementing the scanner– Typically, this is a deterministic finite automaton (DFA) • Examples: Lex, Flex (C and C++), JLex (Java)slide 15Finite State AutomataSet of states• Usually represented as graph nodesInput alphabet + unique “end of program” symbolState transition function• Usually represented as directed graph edges (arcs)• Automaton is deterministic if, for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbolUnique start stateOne or more final (accepting) statesslide 16DFA for C Identifiersslide 17Traversing a DFAConfiguration = state + remaining inputMove = traversing the arc exiting the state that corresponds to the leftmost input symbol, thereby consuming itIf no

View Full Document