Lecture 2: Lexical AnalysisReview: Front End Compiler StructureTokensClassical Regular ExpressionsAbbreviationsExtensionsReview of Sample ProgramsProblemsSome Problem SolutionsLecture 2: Lexical AnalysisAdministrivia• Newsgroup appears to be functioning, now managed by CSUA. Visitnews.csua.berkeley.edu• Lecture page also has readings. Try to read oncebeforelecture.• Log into your class account ASAP (I still have account forms).• Start forming teams:– Choose team name (letters, digits, underscores only, starting withcapital letter)– Email me ([email protected]) name of team, and classlogins of members (also mail changes).• Good time to start learning Python (manuals online).Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 1Review: Front End Compiler StructureSourcecodeLexicalAnalysisTokensParsingASTSemanticAnalysisDecoratedASTWe are here• A sequence of translations that each:– Filter out errors– Remove or put aside extraneous information– Make data more conveniently accessible.• Strategy: find tools that partially automate this procedure.• For lexical analysis: convert description that uses patterns (ex-tended regular expressions) into program.Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 2Tokens• Token consists ofsyntactic category(like “noun” or “adjective”) plussemantic information(like a particular name).• Parsing (the “customer”) only needs syntactic category:– “Joe went to the store” and “Harry went to the beach” have samegrammatical structure.• For programming, semantic information might be text of identifieror numeral.• Example from Notes:if(i== j)z = 0; /* No work needed */elsez= 1;=⇒IF, LPAR, ID("i"), EQUALS,ID("j"), RPAR, ID("z"),ASSIGN, INTLIT("0"), SEMI,ELSE, ID("z"), ASSIGN,INTLIT("1"), SEMILast modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 3Classical Regular Expressions• Regular expressions denote formal languages, which are sets of strings(of symbols from some alphabet).• Appropriate since internal structure not all that complex yet.• Expression R denotes language L(R):– L(ǫ) = L("") = {""}.– If c is a character, L(c) = {"c"}.– If R1, R2are r.e.s, L(R1R2) = {x1x2|x1∈ L(R1), x2∈ L(R2)}.– L(R1|R2) = L(R1) ∪ L(R2).– L(R∗) = L(ǫ) ∪ L(R) ∪ L(R R) ∪ · · ·.– L((R)) = L(R).• Precedence is ‘*’ (highest), concatenation, union (lowest). Parenthe-ses also provide grouping.Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 4Abbreviations• Character lists, such as [abcf-mxy] in Java, Perl, or Python.• Negative character lists, such as [^aeiou].• Character classes such as . (dot), \d, \s in Java, Perl, Python.• L(R+) = L(RR∗).• L(R?) = L(ǫ|R).Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 5Extensions• “Capture” parenthesized expressions:– After m = re.match(r’\s*(\d+)\s*,\s*(\d+)\s’, ’12,34’), havem.group(1) == ’12’, m.group(2) == ’34’.• Lazy vs. greedy quantifiers:– re.match(r’(\d+).*’, ’1234ab’) makes group(1) match ’1234’.– re.match(r’(\d+?).*’, ’1234ab’) makes group(1) match ’1’.• Boundaries:– re.search(r’(^abc|qef)’, L) matches abc only at beginning ofstring, and qef anywhere.– re.search(r’(?m)(^abc|qef)’, L) matches abc only at begin-ning of string or of any line.– re.search(r’rowr(?=baz)’, L) matches an instance of ‘rowr’,but only if ‘baz’ follows (does not match baz).– re.search(r’(?!rowr)baz’, L) matches an instance of ‘baz’, butonly if immediately preceded by ‘rowr’ (does not match rowr).• Non-linear patterns: re.search(r’(\S+),\1’, L) matches a wordfollowed by the same word after a comma.Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 6Review of Sample ProgramsSL/1 “language”:+ - * / = ; , ( ) < >>= <= -->if def else fi whileidentifiersdecimal numeralsComments start with # and go to end of line.(Review of programs in Chapter 2 of Course Notes.)Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 7Problems• Decimal numerals in C, Java.• All numerals in C, Java.• Floating-point numerals.• Identifiers in C, Java.• Identifiers in Ada.• Comments in C++, Java.• XHTML markups.• Python bracketing.Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2 8Some Problem Solutions• Decimal numerals in C, Java: 0|[1-9][0-9]*• All numerals in C, Java: [1-9][0-9]+|0[xX][0-9a-fA-F]+|0[0-7]*• Floating-point numerals: (\d+\.\d*|\d*\.\d+)([eE][-+]?\d+)?|[0-9]+[eE][-+• Identifiers in C, Java. (ASCII only, no dollar signs):[a-zA-Z][a-zA-Z 0-9]*• Identifiers in Ada: [a-zA-Z]([a-zA-Z0-9]| [a-zA-Z0-9])*• Comments in C++, Java: //.*|/\*([^*]|\*[^/])*\*+/or, using some extended features: //.*|/\*(.|\n)*?\*/• Python bracketing:Nothing much you can do here, except to noteblanks at the beginnings of lines and to do some programming in theactions.Last modified: Mon Feb 23 14:35:34 2009 CS164: Lecture #2
View Full Document