DOC PREVIEW
UD CISC 672 - Lexical Analysis - An Introduction

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Lexical Analysis - An IntroductionThe Front EndThe purpose of the front end is to deal with the input language• Perform a membership test: code  source language?• Is the program well-formed (semantically) ?• Build an IR version of the code for the rest of the compilerThe front end is not monolithicSourcecodeFrontEndErrors MachinecodeBackEndIRThe Front EndScanner •Maps stream of characters into wordsBasic unit of syntaxx = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; >• Characters that form a word are its lexeme• Its part of speech (or syntactic category) is called its token type• Scanner discards white space & (often) commentsSourcecodeScannerIRParserErrors tokensSpeed is an issue in scanning use a specialized recognizerThe Front EndParser•Checks stream of classified words (parts of speech) for grammatical correctness• Determines if code is syntactically well-formed•Guides checking at deeper levels than syntax• Builds an IR representation of the codeWe’ll come back to parsing in a couple of lecturesSourcecodeScannerIRParserErrors tokensThe Big PictureWhy study lexical analysis?• We want to avoid writing scanners by hand• We want to harness the theory from classes like CISC 303Goals:To simplify specification & implementation of scannersTo understand the underlying techniques and technologiesScannerScannerGeneratorspecificationssource code parts of speech & wordstables or codeSpecifications written as “regular expressions”Represent words as indices into a global tableRecognizing Words• Finite Automaton (FA) – recognizers that can scan a stream of symbols to find words•An FA is a five-tuple (S,,∂,s0 ,SF ) where • S is the set of states •  is the alphabet• ∂ a set of transition functions where each takes a state and a character and returns another state• s0 is the start state• SF is the set of final states S0 S1 r(0|1|2| … 9)accepting state(0|1|2| … 9)Recognizer for RegisterS2Regular ExpressionsLexical patterns form a regular language *** any finite language is regular ***Regular expressions (REs) describe regular languagesRegular Expression (over alphabet )•  is a RE denoting the set {}•If a is in  , then a is a RE denoting {a}•If x and y are REs denoting L(x) and L(y) thenx |y is an RE denoting L(x)  L(y)xy is an RE denoting L(x)L(y)x* is an RE denoting L(x)*Precedence is closure, then concatenation, then alternationEver type “rm *.o a.out” ?These definitions should be well knownSet Operations (review)O p e r a t i o n D e fi n i t i o n U n i o n o f L a n d M W r i t t e n L  M L  M = { s | s  L o r s  M } C o n c a t e n a t i o n o f L a n d M W r i t t e n L M L M = { s t | s  L a n d t  M } K l e e n e c l o s u r e o f L W r i t t e n L* L* = 0i  Li P o s i t i v e C l o s u r e o f L W r i t t e n L+ L+ = 1i  LiExamples of Regular ExpressionsIdentifiers:Letter  (a|b|c| … |z|A|B|C| … |Z)Digit  (0|1|2| … |9)Identifier  Letter ( Letter | Digit )*Numbers:Integer  (+|-| ) (0| (1|2|3| … |9)(Digit *) )Decimal  Integer . Digit *Real  ( Integer | Decimal ) E (+|-|) Digit *Complex  ( Real , Real )Numbers can get much more complicated!Regular Expressions (the point)Regular expressions can be used to specify the words to be translated to parts of speech by a lexical analyzerUsing results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressionsSome of you may have seen this construction in CISC 303 for string pattern matching We study REs and associated theory to automate scanner construction !Consider the problem of recognizing ILOC register namesRegister  r (0|1|2| … | 9) (0|1|2| … | 9)*• Allows registers of arbitrary number• Requires at least one digitRE corresponds to a recognizer (or DFA)Transitions on other inputs go to an error state, seExample S0 S1 r(0|1|2| … 9)accepting state(0|1|2| … 9)Recognizer for RegisterS2DFA operation•Start in state S0 & take transitions on each input character•DFA accepts a word x iff x leaves it in a final state (S2 )So,•r17 takes it through s0, s1, s2 and accepts•r takes it through s0, s1 and fails•a takes it straight to seExample (continued)S0 S1 r(0|1|2| … 9)accepting state(0|1|2| … 9)Recognizer for RegisterS2Example (continued)To be useful, recognizer must turn into codeseseseseses2ses2ses2ses1seses1s0All others0,1,2,3,4,5,6,7,8,9rChar  next characterState  s0while (Char  EOF) State  (State,Char) Char  next characterif (State is a final state ) then report success else report failureSkeleton recognizerTable encoding REr Digit Digit* allows arbitrary numbers• Accepts r00000 • Accepts r99999• What if we want to limit it to r0 through r31 ?Write a tighter regular expressionRegister  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )Register  r0|r1|r2| … |r31|r00|r01|r02| … |r09Produces a more complex DFA• Has more states• Same cost per transition• Same basic implementationWhat if we need a tighter specification?Tighter register specification (continued)The DFA forRegister  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )•Accepts a more constrained set of registers•Same set of actions, more states S0 S5 S1 rS4 S3 S6 S2 0,1,230,14,5,6,7,8,9(0|1|2| … 9)Tighter register specification (continued)seseseseses1s0seseseseseseseseseseseseses6seseseses6ses5seseseseseses4seseseseseses3ses3s3s3s3ses2ses4s5s2s2ses1Allothers4-9320,1rTable encoding RE for the tighter register specification Runs in the same skeleton


View Full Document

UD CISC 672 - Lexical Analysis - An Introduction

Documents in this Course
Syllabus

Syllabus

18 pages

Load more
Download Lexical Analysis - An Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lexical Analysis - An Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lexical Analysis - An Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?