CMSC 330: Organization of Programming Languages Context Free Grammars and Parsing CMSC 330 - Spring 2011 1CMSC 330 - Spring 2011 Where We Are ! Programming language generalities • Types of languages • Compilers and interpreters ! Ruby • An object-oriented scripting language • Text-processing via regular expressions ! Theory of regular languages • Regular expressions • Finite automata (deterministic, nondeterministic) 2CMSC 330 - Spring 2011 Up Next: Syntax of Prog. Langs. ! Structure of syntactic analyzers • Scanning / lexing • Parsing ! Defining programming-language syntax … • Via regular expressions … • … and context-free grammars (CFGs) Definition Derivations (leftmost) Parse tree Ambiguity Associativity & precedence 3CMSC 330 - Spring 2011 Programming Languages ! Syntax • What a program looks like • Needs to be very precise ! Semantics • What a program means • How different parts of a program behave 4Recall: Architecture of Compilers, Interpreters CMSC 330 - Spring 2011 5 Front End Intermediate Representation Back End Parser Static Analyzer Source Compiler / InterpreterFront End ! Responsible for turning source code (sequence of symbols) into representation of program structure ! Parser generates parse trees, which static analyzer processes CMSC 330 - Spring 2011 6 Front End Source Parser Static Analyzer Intermediate Representation Parse TreeParser Architecture ! Scanner /lexer converts sequences of symbols into tokens (keywords, variable names, operators, numbers, etc.) ! Parser (yes, a parser contains a parser!) converts sequences of tokens into parse tree CMSC 330 - Spring 2011 7 Parser Source Scanner Parser Parse Tree Token StreamSpecifications of Syntax ! Tokens specified using regular expressions ! Valid sequences of tokens specified using context-free grammars ! Automated tools (e.g. lex, yacc) convert regular expressions, grammars into scanners, parsers CMSC 330 - Spring 2011 8CMSC 330 - Spring 2011 Motivation for Grammars ! Programs are just strings of text • But they are strings that have a certain structure ! Informal description of structure of a C program • A C program is a list of declarations and definitions • A function definition contains parameters and a body • A function body is a sequence of statements • A statement is either an expression, a while loop, etc… • An expression may be assignment, addition, etc… 9CMSC 330 - Spring 2011 Motivation for Grammars (cont.) ! We want to describe program structure precisely: • Same programs can be processed by different compilers • Different backends can be used with same parser ! Regular expressions lack sufficient expressive power • No regular expression for balanced pairs of ( )’s { “( )”, “(( ))”, “((( )))”, “(((( ))))”, … } is not a regular language ! Context-free grammars (CFGs) offer a solution! 10CMSC 330 - Spring 2011 Context Free Grammar (CFG) ! A way of generating sets of strings (= languages) ! Grammar: S → 0S | 1S | ε • Means every S may be replaced by 0S, 1S, or ε • Example S ⇒ 0S // using S → 0S ⇒ 01S // using S → 1S ⇒ 011S // using S → 1S ⇒ 011 // using S → ε ! Grammar is same as regular expression (0|1)* • Generates / accepts the same set of strings 11CMSC 330 - Spring 2011 Context-Free Grammars (CFGs) ! But CFGs can do a lot more! • S → ( S ) | ε // generates balanced pairs of ( )’s ! In fact, CFGs subsume REs, DFAs, NFAs • There is a CFG that generates any regular language • But REs are a better notation for regular languages ! CFGs can specify programming language syntax • CFGs (mostly) describe the parsing process 12CMSC 330 - Spring 2011 Formal Definition: Context-Free Grammar ! A CFG G is a 4-tuple (Σ, N, P, S) • Σ – alphabet (finite set of symbols, or terminals) Often written in lowercase • N – a finite, nonempty set of nonterminal symbols Often written in uppercase It must be that N ∩ Σ = ∅ • P – a set of productions of the form N → (Σ|N)* Informally: the nonterminal can be replaced by the string of zero or more terminals / nonterminals to the right of the → Can think of productions as rewriting rules • S N – the start symbol 13CMSC 330 - Spring 2011 Backus-Naur Form ! Context-free grammar production rules are also called Backus-Naur Form or BNF • A production like A → B c D is written in BNF as <A> ::= <B> c <D> (Non-terminals written with angle brackets and ::= instead of →) • Often used to describe language syntax ! BNF was designed by • John Backus Chair of the Algol committee in the early 1960s • Peter Naur Secretary of the committee, who used this notation to describe Algol in 1962 14CMSC 330 - Spring 2011 Informal Definition of Acceptance ! A string is accepted by a CFG if there is • Some sequence of applying productions (rewrites) starting at the start symbol that generates the string ! Example • Grammar: S → 0S | 1S | ε Shorthand for S → 0S, S → 1S, S → ε • Sequence generating the string 010 S ⇒ 0S ⇒ 01S ⇒ 010S ⇒ 010 ! Terminology • Such a sequence of rewrites is a derivation or parse • Discovering the derivation is called parsing 15CMSC 330 - Spring 2011 Derivations ! Notation ⇒ indicates a derivation of one step ⇒+ indicates a derivation of one or more steps ⇒* indicates a derivation of zero or more steps ! Example • S → 0S | 1S | ε ! For the string 010 • S ⇒ 0S ⇒ 01S ⇒ 010S ⇒ 010 • S ⇒+ 010 • S ⇒* S 16CMSC 330 - Spring 2011 Practice ! Try to make a grammar which accepts • 0*|1* – 0n1n where n ≥ 0 – 0n1m where m ≤ n ! Give some example strings from this language • S → 0 | 1S 0, 10, 110, 1110, 11110, … • What language is it? 1*0 S → A | B A → 0A | ε B → 1B | ε S → 0S1 | ε S → 0S1 | 0S | ε 17CMSC 330 - Spring 2011 Example: Arithmetic Expressions ! E → a | b | c | E+E | E-E | E*E | (E) • An expression E is either a letter a, b, or c • Or an E followed by +
View Full Document