Unformatted text preview:

Abstract Syntax Trees COMS W4115 Prof Stephen A Edwards Spring 2007 Columbia University Department of Computer Science Parsing and Syntax Trees Parsing decides if the program is part of the language Not that useful we want more than a yes no answer Like most ANTLR parsers can include actions pieces of code that run when a rule is matched Top down parsers actions executed during parsing rules Bottom up parsers actions executed when rule is reduced Actions Simple languages can be interpreted with parser actions class CalcParser extends Parser expr returns int r int a r 0 r mexpr a mexpr r a EOF mexpr returns int r int a r 0 r atom a atom r a atom returns int r r 0 i INT r Integer parseInt i getText Actions In a top down parser actions are executed during the matching routines Actions can appear anywhere within a rule before during or after a match rule before A during B C D after Bottom up parsers restricted to running actions only after a rule has matched Implementing Actions Nice thing about top down parsing grammar is essentially imperative Action code simply interleaved with rule matching Easy to understand what happens when Implementing Actions expr returns int r int a r 0 r mexpr a mexpr r a EOF public final int expr int r int a r 0 r mexpr while LA 1 PLUS match PLUS a mexpr r a match Token EOF TYPE return r What ANTLR builds a mexpr r a Actions Usually actions build a data structure that represents the program Separates parsing from translation Makes modification easier by minimizing interactions Allows parts of the program to be analyzed in different orders Actions Bottom up parsers can only build bottom up data structures Children known first parents later Constructor for any object can require knowledge of children but not of parent Context of an object only established later Top down parsers can build both kinds of data structures What To Build Typically an Abstract Syntax Tree that represents the program Represents the syntax of the program almost exactly but easier for later passes to deal with Punctuation whitespace other irrelevant details omitted Abstract vs Concrete Trees Like scanning and parsing objective is to discard irrelevant details E g comma separated lists are nice syntactically but later stages probably just want lists AST structure almost a direct translation of the grammar Abstract vs Concrete Trees expr mexpr mexpr mexpr atom atom atom INT 3 5 4 expr mexpr mexpr atom atom atom INT 3 INT 5 INT 3 INT 5 INT 4 INT 4 Concrete Parse Tree Abstract Syntax Tree Implementing ASTs Most general implementation ASTs are n ary trees Each node holds a token and pointers to its first child and next sibling Parent Last Sibling Node First Child Next Sibling Example of AST structure if a b a b b a if a b a b b a Typical AST Operations Create a new node Append a subtree as a child a b a a b b a b b b a a Comment on Generic ASTs Is this general purpose structure too general Not very object oriented whole program represented with one type Alternative Heterogeneous ASTs one class per object class BinOp int operator Expr left right class IfThen Expr predicate Stmt thenPart elsePart Heterogeneous ASTs Advantage avoid switch statements when walking tree Disadvantage each analysis requires another method class BinOp int operator Expr left right void typeCheck void constantProp void buildThreeAddr Analyses spread out across class files Classes become littered with analysis code additional annotations Comment on Generic ASTs ANTLR offers a compromise It can automatically generate tree walking code It generates the big switch statement Each analysis can have its own file Still have to modify each analysis if the AST changes Choose the AST structure carefully Building ASTs The Obvious Way to Build ASTs class ASTNode ASTNode Token t void appendChild ASTNode c void appendSibling ASTNode C stmt returns ASTNode n if p expr then t stmt else e stmt n new ASTNode new Token IF n appendChild p n appendChild t n appendChild e The Obvious Way Putting code in actions that builds ASTs is traditional and works just fine But it s tedious Fortunately ANTLR can automate this process Building an AST Automatically with ANTLR class TigerParser extends Parser options buildAST true By default each matched token becomes an AST node Each matched token or rule is made a sibling of the AST for the rule After a token makes the node a root of a subtree After a token prevents an AST node from being built Automatic AST Construction Running class CalcParser extends Parser options buildAST true expr mexpr mexpr EOF mexpr atom atom atom INT on 2 3 4 5 6 gives 2 3 4 5 6 EOF AST Construction with Annotations Running class CalcParser extends Parser options buildAST true expr mexpr mexpr EOF mexpr atom atom atom INT on 6 2 3 4 5 6 gives 2 3 4 5 Choosing AST Structure Designing an AST Structure Sequences of things Removing unnecessary punctuation Additional grouping How many token types Sequences of Things Comma separated lists are common int gcd int a int b int c args arg arg A concrete parse tree Drawbacks args arg int arg a int Many unnecessary nodes b arg Branching suggests recursion int c Harder for later routines to get the data they want Sequences of Things Better to choose a simpler structure for the tree Punctuation irrelevant build a simple list int gcd int a int b int c args arg arg args ARGS args ARGS arg int arg a int b arg int c What s going on here args arg arg args ARGS args Rule generates a sequence of arg nodes Node generation supressed for punctuation parens commas Action uses ANTLR s terse syntax for building trees args ARGS args set the args tree to a new tree whose root is a node of type ARGS and whose child is the old args tree What s going on here int a int b int c args arg arg args ARGS args args arg int args arg a arg b int c int ARGS arg int arg a int arg b int c Removing Unnecessary Punctuation Punctuation makes the syntax readable unambiguous Information represented by structure of the AST Things typically omitted from an AST Parentheses Grouping and precedence associativity overrides Separators commas semicolons Mark divisions between phrases Extra keywords while do if then else one is enough How Many Types of Tokens Since each token is a type plus some text there is some choice Generally want each different construct to have a different token type Different types make sense when each needs different analysis Arithmetic operators usually not that different For the assignment you need to


View Full Document

Columbia COMS W4115 - Abstract Syntax Trees

Documents in this Course
YOLT

YOLT

13 pages

Lattakia

Lattakia

15 pages

EasyQL

EasyQL

14 pages

Photogram

Photogram

163 pages

Espresso

Espresso

27 pages

NumLang

NumLang

6 pages

EMPATH

EMPATH

14 pages

La Mesa

La Mesa

9 pages

JTemplate

JTemplate

238 pages

MATVEC

MATVEC

4 pages

TONEDEF

TONEDEF

14 pages

SASSi

SASSi

16 pages

JTemplate

JTemplate

39 pages

BATS

BATS

10 pages

Synapse

Synapse

11 pages

c.def

c.def

116 pages

TweaXML

TweaXML

108 pages

Load more
Loading Unlocking...
Login

Join to view Abstract Syntax Trees and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Abstract Syntax Trees and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?