This preview shows page 1-2-3-4-5-6-7-51-52-53-54-55-56-57-58-103-104-105-106-107-108-109 out of 109 pages.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions CMSC 330 - Spring 2011 1CMSC 330 - Spring 2011 2 Introduction ! That’s it for the basics of Ruby • If you need other material for your project, come to office hours or check out the documentation ! Next up: How do regular expressions (REs) really work? • Mixture of a very practical tool (string matching with REs) and some nice theory • A great computer science resultCMSC 330 - Spring 2011 3 A Few Questions About REs ! What does a regular expression represent? • Just a set of strings ! What are the basic components of REs? • E.g., we saw that e+ is the same as ee* ! How are REs implemented? • We’ll see how to build a structure to parse REsCMSC 330 - Spring 2011 4 Definition: Alphabet ! An alphabet is a finite set of symbols • Usually denoted Σ ! Example alphabets: • Binary: • Decimal: • Alphanumeric: Σ = {0,1} Σ = {0,1,2,3,4,5,6,7,8,9} Σ = {0-9,a-z,A-Z}CMSC 330 - Spring 2011 5 Definition: String ! A string is a finite sequence of symbols from Σ • ε is the empty string ("" in Ruby) • |s| is the length of string s |Hello| = 5, |ε| = 0 • Note Ø is the empty set (with 0 elements) Ø ≠ { ε } ≠ ε ! Example strings: • 0101 ∈ Σ = {0,1} (binary) • 0101 ∈ Σ = decimal • 0101 ∈ Σ = alphanumericCMSC 330 - Spring 2011 6 Definition: Concatenation ! Concatenation is indicated by juxtaposition • If s1 = super and s2 = hero, then s1s2 = superhero • Sometimes also written s1·s2 • For any string s, we have sε = εs = s • You can concatenate strings from different alphabets; then the new alphabet is the union of the originals: If s1 = super ∈ Σ1 = {s,u,p,e,r} and s2 = hero ∈ Σ2 = {h,e,r,o}, then s1s2 = superhero ∈ Σ3 = {e,h,o,p,r,s,u}CMSC 330 - Spring 2011 7 Definition: Language ! A language is a set of strings over an alphabet ! Example: The set of phone numbers over the alphabet Σ = {0, 1, 2, 3, 4, 5, 6, 7, 9, (, ), -} • Give an example element of this language • Are all strings over the alphabet in the language? • Is there a Ruby regular expression for this language? ! Example: The set of all strings over Σ • Often written Σ* /\(\d{3,3}\) \d{3,3}-\d{4,4}/ No (123) 456-7890CMSC 330 - Spring 2011 8 Definition: Language (cont.) ! Example: The set of strings of length 0 over the alphabet Σ = {a, b, c} • {s | s ∊ Σ* and |s| = 0} = {ε} ≠ Ø ! Example: The set of all valid Ruby programs • Is there a Ruby regular expression for this language? ! Can REs represent all possible languages? • The answer turns out to be no! • The languages represented by regular expressions are called, appropriately, the regular languages No. Matching (an arbitrary number of) brackets so that they are balanced is impossible. { { { … } } }CMSC 330 - Spring 2011 9 Operations on Languages ! Let Σ be an alphabet and let L, L1, L2 be languages over Σ ! Concatenation L1L2 is defined as • L1L2 = {xy | x ∊ L1 and y ∊ L2} • Example: L1 = {“hi”, “bye”}, L2 = {“1”, “2”} L1L2 = {“hi1”, “hi2”, “bye1”, “bye2”} ! Union is defined as • L1 ∪ L2 = { x | x ∊ L1 or x ∊ L2} • Example: L1 = {“hi”, “bye”}, L2 = {“1”, “2”} L1∪L2 = {“hi”, “bye”, “1”, “2”}CMSC 330 - Spring 2011 10 Operations on Languages (cont.) ! Define Ln inductively as • L0 = {ε} • Ln = LLn-1 for n > 0 ! In other words, • L1 = LL0 = L{ε} = L • L2 = LL1 = LL • L3 = LL2 = LLL • ...CMSC 330 - Spring 2011 11 Examples of Ln ! Let L = {a, b, c} ! Then • L0 = {ε} • L1 = {a, b, c} • L2 = {aa, ab, ac, ba, bb, bc, ca, cb, cc} ! How many elements will L3 have? Ln? How long with each element be in these languages?CMSC 330 - Spring 2011 12 Operations on Languages (cont.) ! Kleene closure is defined as L* = ⋃i ∊N Li (N = {0,1,2,…} is the set of natural numbers) ! In other words... L* is the language (set of all strings) formed by concatenating together zero or more instances of strings from LCMSC 330 - Spring 2011 13 Definition: Regular Expressions ! Given an alphabet Σ, the regular expressions over Σ are defined inductively as regular expression denotes language Ø Ø ε {ε} each element σ ∊ Σ {σ} ConstantsCMSC 330 - Spring 2011 14 Definition: Regular Expressions (cont.) ! Let A and B be regular expressions denoting languages LA and LB, respectively ! There are no other regular expressions over Σ regular expression denotes language AB LALB (A|B) LA ∪ LB A* LA* OperationsCMSC 330 - Spring 2011 15 Precedence ! Order in which operators are applied • In arithmetic Multiplication × > addition + 2 × 3 + 4 = (2 × 3) + 4 = 10 • In regular expressions Kleene closure * > concatenation > union | ab|c = ( a b ) | c = {“ab”, “c”} ab* = a ( b* ) = {“a”, “ab”, “abb”…} a|b* = a | ( b* ) = {“a”, “”, “b”, “bb”, “bbb”…} • Can change order using parentheses ( ) E.g., a(b|c), (ab)*, (a|b)*CMSC 330 - Spring 2011 16 The Language Denoted by an RE ! For a regular expression e, we will write [[e]] to mean the language denoted by e • [[a]] = {a} • [[(a|b)]] = {a, b} ! If s∊[[ RE ]], we say that RE accepts, describes, or recognizes sCMSC 330 - Spring 2011 17 Example 1 ! All strings over Σ = {a, b, c} such that all the a’s are first, the b’s are next, and the c’s last • Example: aaabbbbccc but not abcb ! Regexp: • This is a valid regexp because: a is a regexp ([[a]] = {a}) a* is a regexp ([[a*]] = {ε, a, aa, ...}) Similarly for b* and c* So a*b*c* is a regular expression (Remember that we need to check this way because regular expressions are defined inductively.) a*b*c*CMSC 330 - Spring 2011 18 Which Strings Does a*b*c* Recognize? aabbbcc Yes; aa ∊ [[a*]], bbb ∊ [[b*]], and cc ∊ [[c*]], so entire string is in [[a*b*c*]] abb Yes, abb = abbε, and ε ∊ [[c*]] ac Yes ε Yes aacbc No abcd No -- outside the languageCMSC 330 - Spring 2011 19 Example 2 ! All strings over Σ = {a, b, c} ! Regexp: ! Other
View Full Document