11-722: Grammar FormalismsParsing OverviewAlon LavieJanuary 23, 2006References:Hopcroft and Ullman, “Introduction to Automata Theory, Languages and Computation”.James Allen, “Natural Language Understanding”, 2nd edition.Jurafsky and Martin, “Speech and Language Processing”.Formal Languages• A mathematical abstraction of real languages: Natural Languages andComputer Languages• A language is no more than a set of items called words (the equivalent ofsentences in a natural language)• Languages can be defined declaratively, descriptively or computationally• Formal Language Theory: The study of properties of the various types andclasses of languages using formal mathematical proofs• Fundamental problem - word membership:Given a word w and a language L - is w ∈ L?• what algorithm or computational device is necessary to answer this questiondepends on the class of the language1 11-722 Grammar FormalismsBasic Definitions• Σ: the alphabet - a finite (non-empty) set of atomic symbols– each symbol σ in the set is a letter– letters are denoted by lower case Latin letters a, b, c,...• a word is a string of letters from a given alphabet Σ• |w| denotes the length of word w• ² denotes the empty word: |²| = 0• we only consider words of finite length• Def: a language L is a set (finite or infinite) of words constructed from a given alphabetΣ• Examples: L1= {²} L2= Ø L3= {anbn|n ≥ 0 }• Set Theory and operations apply to formal languages:– union, intersection, complementation, membership–L = {w ∈ Σ∗|w 6∈ L}• Important notation:Σ∗= set of all finite words over the alphabet ΣΣi= set of all words of length i over the alphabet Σ2 11-722 Grammar FormalismsLanguage Classes• Sets of formal languages that can be defined using a particular descriptivedefinition or abstraction of a computational framework• Examples:– The set of languages that can be described by Regular Expressions– The set of languages for which we can construct a Finite State Automaton– The set of languages that can be defined using a Context-free Grammar• Knowing the class to which a language belongs will allow us to developefficient algorithms for processing the language or deciding membership inthe language3 11-722 Grammar FormalismsDeterministic FSA• Formal Definition of a DFSA: A = (Q, Σ, δ, q0, F ) where:– Q is a finite set of states– Σ is a finite alphabet– q0∈ Q is an initial (start) state– F ⊆ Q is a set of final states– δ : Q × Σ → Q is the complete transition function• The language accepted by a DFSA A is defined to be:L(A) = {w ∈ Σ∗| after computing on w, A is in a state q ∈ F }• based on the function δ, we defineˆδ, the function on words that models the computationof a DFSA recursively as follows:–ˆδ : Q × Σ∗→ Q–ˆδ(q, ²) = q ∀q ∈ Q–ˆδ(q, xσ) = δ(ˆδ(q, x), σ)• Formal definition of L(A): L(A) = {w ∈ Σ∗|ˆδ(q0, w) ∈ F }• Def: Regular Language: a language L ⊆ Σ∗is called regular if there exists someDFSA A such that L = L(A)• Examples of regular languages: L = Σ∗L = {²} L = (aab)∗4 11-722 Grammar FormalismsContext-Free Grammars• A descriptive generative formalism for specifying the set of words in alanguage using production rules• Formal Definition: a context-free grammar G = (V, T, P, S)– V is a finite set of variables– T is a finite set of terminal symbols (similar to Σ for FSAs)– P is a set of context-free production rules, each of the formA → α, where α ∈ (V ∪ T )∗– S is a start non-terminal (S ∈ V )• Notations:– we denote elements of V by S, A, B, C...– we denote elements of T by a, b, c...– we denote strings over T∗by w, x, y...– we denote strings over (T ∪ V )∗by α, β, γ...– we denote single variables or terminals by X, Y, Z...5 11-722 Grammar FormalismsContext-Free Grammars• Example: L = {anbn| n ≥ 1}G: S --> a S bS --> a b• in this case the language L(G) could be specified in a succinct mathematicalform - often this is difficult or not possible6 11-722 Grammar FormalismsCFG Derivations• derivations describe the process of using the context-free rules to derive astring of terminal symbols• Definition: let ϕ1, ϕ2∈ (V ∪ T )∗.ϕ1directly derives ϕ2, denoted by: ϕ1=⇒Gϕ2,if ϕ1= αAβ, ϕ2= αγβ and A → γ is a rule in PG• ϕ1derives ϕ2, denoted by ϕ1∗=⇒Gϕ2,if there exists a finite sequence of direct derivations such thatϕ1=⇒Gϕ01=⇒Gϕ02=⇒Gϕ03=⇒G· · · =⇒Gϕ2• ϕ1i=⇒Gϕ2denotes that ϕ1derives ϕ2in exactly i derivation steps• a rightmost derivation is a derivation in which at each step, the rightmostnon-terminal in the string is picked for expansion• similarly for a leftmost derivation7 11-722 Grammar FormalismsContext Free Languages (CFLs)• Formal Definition: the language of a CFG G is defined as:L(G) = {w ∈ T∗| S∗=⇒Gw}• a language L is context-free if there exists a grammar G such that L = L(G)• the set of all such languages is called the set of context-free languages (CFLs)• two grammars G1and G2are called equivalent if L(G1) = L(G2)8 11-722 Grammar FormalismsParse Trees• a Parse Tree is a graphical representation of a derivation• the leaves (yield) of the tree correspond to a terminal string in L(G)• the tree does not represent the derivation order of the non-terminals• the tree does reflect the structure of the input string - what rules were used toderive the various substrings of the input• a parse tree constitutes a proof that a given input string is in L(G)• a grammar G is called ambiguous if there exists a word w ∈ L(G) that hastwo or more different parse trees• There exist CFLs that are inherently ambiguous9 11-722 Grammar FormalismsPushdown Automata• An extension of a FSA that is powerful enough to accept CFLs• The FSA is augmented with a memory storage device in the form of a stack• Formal Definition: a PDA M = (Q, Σ, Γ, δ, q0, Z0, F ) where:– Q, Σ, q0, F are similar to those of a FSA– Γ is a finite set of stack symbols– Z0is a start stack symbol– δ : Q × (Σ ∪ {²}) × Γ → 2Q×Γ∗δ(q, σ, Z) = {(q1, γ1), (q2, γ2), ..., (qm, γm)}• Note that a PDA is non-deterministic: it can make ²-moves on the input• It can also: replace the top element of the stack, “push” an element onto thestack, and “pop” an element from the stack10 11-722 Grammar FormalismsRecognition and
View Full Document