Lexical Analysis — Part II: Constructing a Scanner from Regular ExpressionsQuick Review Previous class: → The scanner is the first stage in the front end → Specifications can be expressed using regular expressions → Build tables and code from a DFA Scanner Scanner Generator specifications source code parts of speech & words code and tablesGoal • We will show how to construct a finite state automaton to recognize any RE • This Lecture → Convert RE to an nondeterministic finite automaton (NFA) Requires ε-transitions to combine regular subexpressions → Convert an NFA to a deterministic finite automaton (DFA) Use Subset construction Next Lecture → Minimize the number of states Hopcroft state minimization algorithm → Generate the scanner code Additional code can be insertedMore Regular Expressions • All strings of 1s and 0s ending in a 1 ( 0 | 1 )* 1 • All strings over lowercase letters where the vowels (a,e,i,o,u) occur exactly once, in ascending order Cons → (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z) Cons* a Cons* e Cons* i Cons* o Cons* u Cons* • All strings of 1s and 0s that do not contain three 0s in a row:More Regular Expressions • All strings of 1s and 0s ending in a 1 ( 0 | 1 )* 1 • All strings over lowercase letters where the vowels (a,e,i,o,u) occur exactly once, in ascending order Cons → (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z) Cons* a Cons* e Cons* i Cons* o Cons* u Cons* • All strings of 1s and 0s that do not contain three 0s in a row: ( 1* ( ε |01 | 001 ) 1* )* ( ε | 0 | 00 )Non-deterministic Finite Automata Each RE corresponds to a deterministic finite automaton (DFA) • May be hard to directly construct the right DFA What about an RE such as ( a | b )* abb? This is a little different • S1 has two transitions on a This is a non-deterministic finite automaton (NFA) a , b S0 S3 S1 S2 a b bNon-deterministic Finite Automata Each RE corresponds to a deterministic finite automaton (DFA) • May be hard to directly construct the right DFA What about an RE such as ( a | b )* abb? This is a little different • S1 has two transitions on a • S0 has a transition on ε This is a non-deterministic finite automaton (NFA) a , b S0 S1 S4 S2 S3 ε#a b bNondeterministic Finite Automata • An NFA accepts a string x iff ∃ a path though the transition graph from s0 to a final state such that the edge labels spell x • Transitions on ε consume no input • To “run” the NFA, start in s0 and guess the right transition at each choice point with multiple possibilities → Always guess correctly → If some sequence of correct guesses accepts x then accept Why study NFAs? • They are the key to automating the RE→DFA construction • We can paste together NFAs with ε-transitions NFA NFA becomes an NFA εRelationship between NFAs and DFAs DFA is a special case of an NFA • DFA has no ε transitions • DFA’s transition function is single-valued • Same rules will work DFA can be simulated with an NFA → Obviously NFA can be simulated with a DFA (less obvious) • Simulate sets of possible states • Possible exponential blowup in the state space • Still, one state per character in the input streamAutomating Scanner Construction To convert a specification into code: 1 Write down the RE for the input language 2 Build a big NFA 3 Build the DFA that simulates the NFA 4 Systematically shrink the DFA 5 Turn it into code Scanner generators • Lex, Flex, and JLex work along these lines • Algorithms are well-known and well-understood • Key issue is interface to parser (define all parts of speech)Automating Scanner Construction RE→ NFA (Thompson’s construction) • Build an NFA for each term • Combine them with ε-transitions NFA → DFA (subset construction) • Build the simulation DFA → Minimal DFA • Hopcroft’s algorithm DFA →RE (Not part of the scanner construction) • All pairs, all paths problem • Take the union of all paths from s0 to an accepting state minimal DFA RE NFA DFA The Cycle of ConstructionsRE →NFA using Thompson’s Construction Key idea • NFA pattern for each symbol & each operator • Join them with ε transitions in precedence order S0 S1 aNFA for a S0 S1 aS3 S4 bNFA for ab εNFA for a | b S0 S1 S2 aS3 S4 bS5 εε εεS0 S1 ε#S3 S4 ε#NFA for a* aε#ε#Ken Thompson, CACM, 1968 S0 S1 bNFA for b Concatenation Alternation ClosureExample of Thompson’s Construction Let’s try a ( b | c )* 1. a, b, & c 2. b | c 3. ( b | c )* S0 S1 aS0 S1 bS0 S1 cS2 S3 bS4 S5 cS1 S6 S0 S7 ε#ε#ε# ε#ε# ε#ε# ε#S1 S2 bS3 S4 cS0 S5 ε#ε#ε#ε#Example of Thompson’s Construction (cont'd) 4. a ( b | c )* Of course, a human would design something simpler ... S0 S1 ab | c But, we can automate production of the more complex one ... S0 S1 aε#S4 S5 bS6 S7 cS3 S8 S2 S9 ε#ε#ε# ε#ε# ε#ε# ε#NFA →DFA with Subset Construction Need to build a simulation of the NFA Two key functions • Delta(qi , a) is set of states reachable from each state in qi by a → Returns a set of states, for each n ∈ qi of δi (n, a) • ε-closure(si) is set of states reachable from si by ε transitions The algorithm: • Start state derived from n0 of the NFA • Take its ε-closure q0 = ε-closure(n0) • Compute Delta(q, α) for each α ∈ Σ, and take its ε-closure • Iterate until no more states are added Sounds more complex than it is…NFA →DFA with Subset Construction The algorithm: q0 ← ε-closure(n0 ) Q ← {q0} WorkList ← {q0} while ( WorkList ≠ ф ) remove q from WorkList for each α ∈ Σ# t← ε-closure(Delta(q,α)) T[q,α] ← t if ( t ∉ Q ) then add t to Q and WorkList Let’s think about why this works The algorithm halts: 1. Q contains no duplicates (test before adding) 2. 2Q is finite 3. while loop adds to Q, but does not remove from Q (monotone) ⇒ the loop halts Q contains all the reachable NFA states It tries each character in each q. It builds every possible NFA configuration. ⇒ Q and T form the DFANFA →DFA with Subset Construction Example of a fixed-point computation • Monotone construction of some finite set • Halts when it stops
View Full Document