DOC PREVIEW
Berkeley COMPSCI 164 - Building a Scanner

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Ras Bodik, CS 164, Fall 2004 1Building a ScannerCS164, Fall 2004Ras Bodik, CS 164, Fall 20042Administrativia• Extra credit for bugs in project assignments– in starter kits and handouts– TAs are final arbiters of what’s a bug– only the first student to report the bug gets creditWhat does a lexer do?Ras Bodik, CS 164, Fall 20044Recall: The Structure of a Compilerscannerparsercheckercode genDecaf program (stream of characters)stream of tokensAbstract Syntax Tree (AST) AST with annotations (types, declarations)maybe x86Ras Bodik, CS 164, Fall 20045Recall: Lexical Analysis• The input is just a sequence of characters. Example:if (i == j)z = 0;elsez = 1;• More accurately, the input is string:\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;• Goal: find lexemes and map them to tokens:1. partition input string into substrings (called lexemes), and2. classify them according to their role (role = token)Ras Bodik, CS 164, Fall 20046Continued• Lexer input:\tif(i==j)\n\t\tz = 0;\n\telse\n\t\tz = 1;• partitioned into these lexemes:\t if ( i == j ) \n\t\t z = 0 ; \n\t else \n\t\t z = 1 ;• mapped to a sequence of tokensIF, LPAR, ID(“i”), EQUALS, ID(“j”) …•Notes: – whitespace lexemes are dropped, not mapped to tokens• is this the same fatal mistake as in FORTRAN? (see Lecture 1)– some tokens have attributes: the lexeme and/or line number• why do we need them?2Ras Bodik, CS 164, Fall 20047What’s a Token?• A token is a syntactic category–In English:noun, verb, adjective, …– In a programming language:Identifier, Integer, Keyword, Whitespace, …• Parser relies on the token distinctions: – identifiers are treated differently than keywords– but all identifiers are treated the same, regardless of what lexeme created them Ras Bodik, CS 164, Fall 20048What are lexemes?• Webster: – “items in the vocabulary of a language”• cs164: – same: items in the vocabulary of a language:• numbers, keywords, identifiers, operators, etc.– strings into which the input string is partitioned.How to build a scanner for Decaf?Ras Bodik, CS 164, Fall 200410Writing the lexer•Not by hand– tedious, repetitious, error-prone, non-maintainable• Instead, we’ll build a lexer generator– once we have the generator, we’ll only describe the lexemes and their tokens …• that is, provide Decaf’s lexical specification (the What)– … and generate code that performs the partitioning• generated code hides repeated code (the How)Ras Bodik, CS 164, Fall 200411Code generator: key benefit& The scanner generator allows the programmer to focus on:•Whatthe lexer should do,• rather than Howit should be done.• what: declarative programming• how: imperative programmingRas Bodik, CS 164, Fall 200412Imperative scanner (in Java)• Let’s first build the scanner in Java, by hand:• to see how it is done, and where’s the repetitiouscode that we want to hide• A simple scanner will do. Only four tokens:“*”TIMES“+”PLUS“==“EQUALSa sequence of one or more letters or digits starting with a letterIDLexemeTOKEN3Ras Bodik, CS 164, Fall 200413Imperative scannerc=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;}Ras Bodik, CS 164, Fall 200414Imperative scanner• You could write your entire scanner in this style– and for small scanners this style is appropriate• This code looks simple and clean, but try to add – tokens that start with the same string: “if” and “iffy”– C-style comments: /* anything here /* nested comments */ */– string literals with escape sequences: “…\t … \”…”– error handling, e.g., badly formed strings (see PA2)• Look at StreamTokenizer.nextToken() for an example of real imperative scanner– in Eclipse, type Ctrl+Shift+T.– Enter StreamTokenizer. – Press F4. In the Hierarchy view, select method nextToken.Ras Bodik, CS 164, Fall 200415Maximal munch rule• What is the need for undoNextChar()?– it performs look-ahead, to determine whether the ID lexeme can be grown further• This is an example of maximal much rule:– this rule followed by all scanners– the rule: the input character stream is partitioned into lexemes that are as large as possible– Ex.: in Java, “iffy” is not partitioned into “if” (the IF keyword) and “fy” (ID), but into “iffy” (ID)Ras Bodik, CS 164, Fall 200416Imperative Lexer: what vs. howc=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;} ) little logic, much plumbingRas Bodik, CS 164, Fall 200417Identifying the plumbing (the how)c=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;} ) characters read always the same wayRas Bodik, CS 164, Fall 200418Identifying the plumbing (the how)c=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;} ) tokens are always return-ed4Ras Bodik, CS 164, Fall 200419Identifying the plumbing (the how)c=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;} ) the lookahead is explicitRas Bodik, CS 164, Fall 200420Identifying the plumbing (the how)c=nextChar();if (c == ‘=‘) { c=nextChar(); if (c == ‘=‘) {return EQUALS;}}if (c == ‘+’) { return PLUS; }if (c == ‘*’) { return TIMES; }if (c is a letter) { c=NextChar(); while (c is a letter or digit) { c=NextChar(); }undoNextChar(c);return ID;} ) must build decision tree out of nested if’s (yuck!)Ras Bodik, CS 164, Fall 200421Can we hide the plumbing?• That is, can we avoid having to – spell out the details of calls to the input


View Full Document

Berkeley COMPSCI 164 - Building a Scanner

Documents in this Course
Lecture 8

Lecture 8

40 pages

Load more
Download Building a Scanner
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Building a Scanner and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Building a Scanner 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?