DOC PREVIEW
UVA CS 415 - Scanning

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ScanningParsing & ScanningCharacters vs Tokens (review)Why Separate the Scanner and Parser?TokensTypical Tokens in Programming LanguagesPrinciple of Longest MatchReview: Languages & Automata Theory (in one slide)Regular Expressions and FAsRegular ExpressionsFundamental REsOperations on REsAbbreviationsExamplesMore ExamplesExampleRecognizing REsFinite State AutomatonExample: FSA for “cat”DFA vs NFAFAs in ScannersExample: DFA for hand-written scannerScanner DFA Example (1)Scanner DFA Example (2)Scanner DFA Example (3)Scanner DFA Example (4)Implementing a Scanner by Hand – Token RepresentationSimple Scanner ExampleScanner getToken() methodgetToken() (2)getToken() (3)getToken (4)1ScanningAaron BloomfieldCS 415Fall 20052Parsing & Scanning•In real compilers the recognizer is split into two phases–Scanner: translate input characters to tokens•Also, report lexical errors like illegal characters and illegal symbols–Parser: read token stream and reconstruct the derivationScanner Parsersource tokens3Characters vs Tokens (review)•Input text// this statement does very littleif (x >= y) y = 42;•Token StreamIF LPAREN ID(x) GEQ ID(y)RPARENID(y) BECOMES INT(42) SCOLON4Why Separate the Scanner and Parser?•Simplicity & Separation of Concerns–Scanner hides details from parser (comments, whitespace, input files, etc.)–Parser is easier to build; has simpler input stream•Efficiency–Scanner can use simpler, faster design•(But still often consumes a surprising amount of the compiler’s total execution time)5Tokens•Idea: we want a distinct token kind (lexical class) for each distinct terminal symbol in the programming language–Examine the grammar to find these•Some tokens may have attributes–Examples: integer constant token will have the actual integer (17, 42, ?) as an attribute; identifiers will have a string with the actual id6Typical Tokens in Programming Languages•Operators & Punctuation–+ - * / ( ) { } [ ] ; : :: < <= == = != ! …–Each of these is a distinct lexical class•Keywords–if while for goto return switch void …–Each of these is also a distinct lexical class (not a string)•Identifiers–A single ID lexical class, but parameterized by actual id•Integer constants–A single INT lexical class, but parameterized by int value•Other constants, etc.7Principle of Longest Match•In most languages, the scanner should pick the longest possible string to make up the next token if there is a choice•Examplereturn foobar != hohum;should be recognized as 5 tokensnot more (i.e., not parts of words or identifiers, or ! and = as separate tokens)RETURN ID(foobar) NEQ ID(hohum) SCOLON8Review: Languages & Automata Theory (in one slide)•Alphabet: a finite set of symbols•String: a finite, possibly empty sequence of symbols from an alphabet•Language: a set, often infinite, of strings•Finite specifications of (possibly infinite) languages–Automaton – a recognizer; a machine that accepts all strings in a language (and rejects all other strings)–Grammar – a generator; a system for producing all strings in the language (and no other strings)•A language may be specified by many different grammars and automata•A grammar or automaton specifies only one language9Regular Expressions and FAs•The lexical grammar (structure) of most programming languages can be specified with regular expressions–(Sometimes a little cheating is needed)•Tokens can be recognized by a deterministic finite automaton–Can be either table-driven or built by hand based on lexical grammar10Regular Expressions•Defined over some alphabet Σ–For programming languages, commonly ASCII or Unicode•If re is a regular expression, L(re ) is the language (set of strings) generated by re11Fundamental REsre L(re ) Notesa { a } Singleton set, for each a in Σε { ε } Empty string{ } Empty language12Operations on REs•Precedence: * (highest), concatenation, | (lowest)•Parentheses can be used to group REs as neededre L(re ) Notesrs L(r)L(s) Concatenationr|s L(r) L(s) Combination (union)r* L(r)* 0 or more occurrences (Kleene closure)13Abbreviations•The basic operations generate all possible regular expressions, but there are common abbreviations used for convenience. Typical examples:Abbr. Meaning Notesr+ (rr*) 1 or more occurrencesr? (r | ε) 0 or 1 occurrence[a-z] (a|b|…|z) 1 character in given range[abxyz] (a|b|x|y|z) 1 of the given characters14Examplesre Meaning+ single + character! single ! character= single = character!= 2 character sequence<= 2 character sequencehogwash 7 character sequence15More Examplesre Meaning[abc]+[abc]*[0-9]+[1-9][0-9]*[a-zA-Z][a-zA-Z0-9_]*16Example•Possible syntax for numeric constantsdigit ::= [0-9]digits ::= digit+number ::= digits ( . digits )? ( [eE] (+ | -)? digits ) ?17Recognizing REs•Finite automata can be used to recognize strings generated by regular expressions•Can build by hand or automatically–Not totally straightforward, but can be done systematically–Tools like Lex, Flex, and JLex do this automatically, given a set of REs18Finite State Automaton•A finite set of states–One marked as initial state–One or more marked as final states–States sometimes labeled or numbered•A set of transitions from state to state–Each labeled with symbol from Σ, or ε•Operate by reading input symbols (usually characters)–Transition can be taken if labeled with current symbol–ε-transition can be taken at any time•Accept when final state reached & no more input–Scanner slightly different – accept longest match even if more input•Reject if no transition possible or no more input and not in final state (DFA)19Example: FSA for “cat”a tc20DFA vs NFA•Deterministic Finite Automata (DFA)–No choice of which transition to take under any condition•Non-deterministic Finite Automata (NFA)–Choice of transition in at least one case–Accept if some way to reach final state on given input–Reject if no possible way to final state21FAs in Scanners•Want DFA for speed (no backtracking)•Conversion from regular expressions to NFA is easy•There is a well-defined procedure for converting a NFA to an equivalent DFA22Example: DFA for hand-written scanner•Idea: show a hand-written DFA for some typical programming language


View Full Document

UVA CS 415 - Scanning

Download Scanning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Scanning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Scanning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?