CSC 9010: Text Mining Applications Fall, 2003 Introduction to GATEWhat is GATE?Who Use GATE?How GATE can Help?What are GATE Components?GATE as an architectureLRs: Corpora, Documents, and AnnotationsDocuments Processing in GATEBuilt-in GATE ComponentsDevelop Language Processing Functionality using GATECREOLEPRs: ANNIEANNIE IE ModulesANNIE ComponentsANNIE Component: TokenizerTokenizer RuleExample Tokenizer RuleANNIE Component: GazetteerExample Gazetteer ListANNIE Component: Semantic TaggerANNIE Component: Sentence SplitterANNIE Component: OrthoMatcherCreate a New ResourceExample: Create a New Component Called GoldFishExample: Create GoldFish Using BootStrap WizardGoldFish: default files createdCreate an Application with PRsAdditional FacilitiesEmbedding ANNIE©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptCSC 9010: Text Mining ApplicationsFall, 2003Introduction to GATEDr. Paula [email protected] primarily from a presentation by Lin Lin http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.ppt©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptWhat is GATE?Stands for General Architecture for Text Engineering.The theory behind GATE is SALE (Software Architecture for Language Engineering):–computer processing of human language–computer infrastructure for software development©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptWho Use GATE?Scientists performing experiments that involve processing human languageDevelopers developing applications with language processing componentsTeachers and students of courses about language and language computation©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptHow GATE can Help?Specify an architecture, or organizational structure, for language processing softwareProvide a framework, or class library, that implements the architecture and can be used to embed language processing capabilities in diverse applicationsProvide a development environment built on top of the framework made up of convenient graphical tools for developing components©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptWhat are GATE Components?Reusable software chunks with well defined interfacesUsed in Java beans and Microsoft’s .Net©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptGATE as an architectureBreaks down to three types of components:–LanguageResources (LRs) –represent entities such as lexicons, documents, corpora, annotation schemas, or ontologies;–ProcessingResources (PRs) –represent entities that are primarily algorithmic, such as parsers, generators or ngram modelers;–VisualResources (VRs) –represent visualization and editing components that participate in GUIs.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptLRs: Corpora, Documents, and AnnotationsA Corpus in Gate is a Java Set whose members are Documents.Documents are modeled as content plus annotations plus features.Annotations are organized in graphs, which are modeled as Java sets of Annotation.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptDocuments Processing in GATEDocument:–Formats including XML, RTF, email, HTML, SGML, and plain text.–Identified and converted into GATE annotation format.–Processed by PRs.–Results stored in a serial data store (based on Java serialization) or as XML.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptBuilt-in GATE ComponentsResources for common LE data structures and algorithms, including documents, corpora and various annotation typesA set of language analysis components for Information Extraction (e.g. ANNIE)A range of data visualization and editing components©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptDevelop Language Processing Functionality using GATEProgramming, or the development of Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both.The development environment is used for:–visualization of the data structures produced and consumed during processing–debugging–performance measurement©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptCREOLEA Collection of REusable Objects for Language EngineeringThe set of resources integrated with GATEAll the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptPRs: ANNIEA family of Processing Resources for language analysis included with GATEStands for A Nearly-New Information Extraction system.Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptANNIE IE Modules©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptANNIE ComponentsTokenizerGazetteerSentence SplitterPart of Speech Tagger–produces a part-of-speech tag as an annotation on each word or symbol.Semantic TaggerOrthoMatcher Coreference Module©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptANNIE Component: TokenizerToken Types–word, number, symbol, punctuation, and spaceToken.A tokenizer rule has a left hand side and a right hand side.©2003 Paula MatuszekTaken primarily from a presentation by Lin Lin. http://webster.cs.uga.edu/~lin/GlobalInfoSys/GATE.pptTokenizer RuleOperations used on the LHS:– | (or) – * (0 or more occurrences) –
View Full Document