CMU 11722 Grammar Fomalism - tgrep2_manual - D79823

Home> Schools> Carnegie Mellon University> (11722) > 11722 Grammar Fomalism> tgrep2_manual

CMU 11722 Grammar Fomalism - tgrep2_manual

Pages 19

Download Save

Unformatted text preview:

TGrep2 User Manualversion 1.15Douglas L. T. RohdeMay 10, 20051 IntroductionStatistical analysis of large, syntactically tagged corpora is playing an increasingly important role in lan-guage research. In particular, the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993; Marcus et al.,1994) version of the Wall Street Journal and Brown corpora of English text is a frequently studied resource.Until now, the software tool of choice for analyzing such corpora has been the TGrep program, developedby Richard Pito.The primary function of TGrep is to extract parse trees whose structures match a specified pattern. It is,essentially, grep for trees. However, working with TGrep and talking to others who use it in their researchresulted in a wish list of improvements. Eventually, I undertook to rewrite TGrep in its entirety. The result isTGrep2.1TGrep2 is almost completely backward compatible with TGrep, but introduces a number of newfeatures, including the following major enhancements:• Rather than simply having a set of required relationships and a set of prohibited relationships, nodes canhave full boolean expressions of relationships to other nodes.• Nodes can be given unique labels and may then be referred to by those labels in the pattern specificationor in selecting trees for printing.• Patterns are no longer restricted to simple tree architectures. The use of node labels and segmentedpatterns allows links in a pattern to form back-edges as well, permitting cycles of links.• Customizable output formats allow a variety of information to be reported in a flexible manner.• Multiple search patterns may be specified and one can retrieve the first subtree matching any pattern, thefirst subtree matching each pattern, or all subtrees matching any patterns, or all matches between subtreesand patterns.• Subtrees can be reported using a code rather than by printing the whole structure. The trees themselvescan later be retrieved using the codes.• A variety of new links have been added and the immediately-precedes link now has a more conventionalmeaning.• Macros can be defined and used to simplify pattern specification.• TGrep2 corpus files are substantially smaller than TGrep corpora.2 Preparing CorporaBefore using TGrep2, a special corpus file must be created in a format that is a bit different from that of thecorpus files used by TGrep. Corpus files are generated using the -p option to TGrep2. Following the -p aretwo arguments, one giving the name of a text input file containing the corpus trees and one giving the name1The design and implementation of TGrep2 is Copyright (C) 2001-2005 Douglas Rohde. Comments, questions, or bug reportsshould be sent to [email protected]. TGrep2 is distributed under the conditions of the GNU General Public License, version 2.Rohde TGrep2 Manualof the binary output file to which the corpus will be written. TGrep2 corpus files normally end in the “.t2c”extension, but that is not required.The input file contains sentences in parenthesized tree format.2Each tree must begin with an openparenthesis. Here is an example of an input file with three very short sentences:(TOP (NP (NP (NN Research)) (CC and) (NP (NN development))))(TOP (NP (NP (NN Budget)) (VP (VBD increased))))# This is a comment.(TOP (NP (JJ Stretch) (NN yarn) (NNS machines)))There is currently a limit of 255 children per node and 65,535 nodes per sentence due to the format in whichcorpus files are stored. The limit on children per node can be raised to 65,535 by using the -K flag whenbuilding the corpus.Sentences parsed in Combinatory Categorial Grammar (CCG) style will have a somewhat differentformat. When building corpora in this style, the -g flag should precede -p. See Section 6.1 for more details.Comments can appear in the input file if they are on a line starting with a #. By default, comments inthe input file are not stored in the corpus file. But if the -C option is specified prior to -p, the commentimmediately preceding each sentence will be associated with the sentence and recorded in the corpus. Thecomments can later be printed when a match is made during a search.If either the input or output files have the .gz, .bz, .bz2, or .Z extension, they will automatically bedecompressed or compressed, respectively. If the input file is compressed, these extensions need not bespecified. To write to standard output, “–” can be given in place of the output file. However, the input filecannot be read from a pipe or standard input. The input must come from a static file.Here is one procedure for converting a TGrep corpus file to a TGrep2 corpus file:% export TGREP_CORPUS=wsj_mrg.crp% tgrep -n __ | grep . | gzip > wsj_mrg.txt.gz% tgrep2 -C -p wsj_mrg.txt wsj_mrg.t2c.gzTo combine all of the Brown corpus .mrg files in the Treebank 3 distribution into a TGrep2 corpus file,you can do the following. Notice that the label TOP will be inserted at the head of each main sentence tree:% cd parsed/mrg/brown% cat */*.mrg | grep -v ’ˆ\*’ | sed ’s/ˆ( /(TOP /’ > brown.txt% tgrep2 -p brown.txt brown.t2cOn most machines, it is not much slower (and is sometimes faster) to use a gzipped compressed corpusfile rather than an uncompressed one because of the tradeoff between the decompression time and the re-duced disk access. When uncompressed, TGrep2 corpus files are about 20% the size of TGrep corpus files.And when compressed, they are under 5% the size of an uncompressed TGrep corpus. Bunzip2 (.bz2) willachieve about twice as good compression as gunzip on corpus files, but the decompression may take overtwice as long during querying.3 Command-line ArgumentsTGrep2 is used as follows:2Although the ability to search for tree structures may be helpful in many other domains, TGrep2 is most often used forprocessing corpora of parsed sentences. Therefore, the terms tree and sentence are used interchangeably throughout this manual,since each sentence is associated with a single parse tree.2Rohde TGrep2 Manual% tgrep2 [options] <pattern>...Following the options, which are listed below, should be one or more pattern specifications. A patternspecification is either a pattern itself or the name of a file containing patterns. If there is a readable filematching the argument, the patterns will be taken from that file. Otherwise, the argument will be treated asa pattern. If more than one pattern or pattern file is given, they are effectively concatenated.There are two exceptions to the usual use

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU 11722 Grammar Fomalism - tgrep2_manual

Sign up for free to view:

Please select your school