Unformatted text preview:

Comparing Ontology based and Corpusbased Domain Annotations in WordNet A paper by Bernardo Magnini Carlo Strapparava Giovanni Pezzulo Alfio Glozzo Presented by rabee ali alshemali Motive Domain information is an emerging topic of interest in relation to WrodNet Proposal An investigation into comparing and integrating ontology based and corpus based domain information WordNet Domains Magnini and Cavaglia 2000 An extension of WordNet 1 6 Provides a lexical resource where WordNet synsets have been manually annotated with domain labels such as Medicine Sport and Architecture The annotation reflects the lexico semantic criteria adopted by humans involved in the annotation and takes advantage of existing conceptual relations in WordNet Question How well this annotation reflects the way synsets occur in a certain text collection Why is this important It is particularly relevant when we want to use manual annotation for text processing tasks e g Word Sense Disambiguation Example to Illustrate Consider the following synset heroin diacetyl morphine horse junk scag smack It is annotated with the Medicine domain because heroin is a drug and that is maybe best described as medical knowledge Example to Illustrate Cont On the other hand on the text side if we consider a news collection Reuters corpus for example the word heroin is likely to occur in the context of either Crime news Administrative news And without any strong relation with the medical field The moral behind the example We can clearly see the difference Manual annotation considers the technical use of the word Text on the other hand records a wider context of use How to reconcile Both sources carry relevant information so supporting ontology based domain annotations with corpus based distribution will probably give the best potential for content based text analysis What is needed First Step a methodology is required to automatically acquire domain information for synsets in WordNet from a categorized corpus Reuters corpus is used because it is free and neatly organized by means of topic codes which makes comparisons with WorldNet domains easier Optimal Goal A large scale automatic acquisition of domain information for WordNet Synsets However The investigation was limited to a small set of topic codes Why is domain information interesting Due to its utility in many scenarios such as Word Sense Disambiguation WSD where information from domain labels are used to establish semantic relations among word senses Text Categorization TC Where categories are represented as symbolic labels WordNet Domains Domains have been used to mark technical usages of words In dictionaries it is used only for a small portion of the lexicon Therefore WordNet Domains is an attempt to extend the coverage of domain labels with an already existing lexical database WordNet version 1 6 Synsets have been annotated with at least one domain label selected from a set of about 200 labels hierarchically organized WordNet Domains PHILOSOPHY ARCHAEOLOGY ASTROLOGY RELIGION PALEOGRAPHY THEOLOGY MYTHOLOGY OCCULTISM DOCTRINES PSYCHOLOGY PSYCHOANALYSIS LITERATURE PHILOLOGY LINGUISTICS GRAMMAR HISTORY HERALDRY PHOTOGRAPHY ART THEATHRE MUSIC WordNet Domains Information brought by domains is complementary to what is already in WrodNet Three key Observations 1 A domain my include synsets of different syntactic categories For example The medicine domain groups together senses from Nouns such as doctor 1 and hospital 1 and also from Verbs such as operate 1 WordNet Domains 2 A domain may include senses from different WordNet sub hierarchies for example The sport domain contains senses such as Athlete 1 from life form 1 game equipment 1 from physical object 1 sport 1 from act 2 playing field 1 from location 1 WordNet Domains 3 domains may group senses of the same word into homogenous clusters but side effect Reduction in word polysemy WordNet Domains The word bank has 10 different senses Three of them 1 3 and 6 can be grouped under the Economy domain While 2 and 7 both belong to the Geography and Geology domain Reduction of the polysemy from 10 to 7 senses Sense 1 Synset and Gloss Domains Depository financial institution bank banking banking company Economy 2 3 4 bank sloping land Geography Geology bank a supply or stock held in a reserve Economy bank bank building a building Architecture Economy 5 6 7 8 9 bank an arrangement of similar objects Factotum savings bank coin bank money box Economy bank a long ridge or pile Geography Geology Bank the funds held by a gambling house Economy Play bank cant camber a slope in the the turn of a road Architecture bank a flight maneuver Transport 10 Procedure for synset annotation It is an inheritance based procedure to automatically mark synsets A small number of high level synsets are manually annotated with their pertinent domains An automatic procedure exploits WrodNet relations i e hyponymy antonymy meronymey to extend the manual assignments to all reachable synsets Example o Consider the following synset beak bill neb nib o It will be automatically marked with the code Zoology starting from the synset bird and following part of relation Issues Oh man why there always have to be issues o Wrong propagation Consider barber chair 1 is part of barber shop 1 barber shop 1 is annotated with Commerce barber chair 1 would wrongly inherit the same domain Therefore in such cases the inheritance procedure has to be blocked to prevent wrong propagation How to fix The inheritance procedure allows the declarations of exceptions Example Assign shop 1 to Commerce With exception part isa shop 1 which assigns the synset shop 1 to Commerce but excludes the parts of the children of shop 1 such as barbershop 1 Issues Cont FACTOTUM a number of WordNet synsets do not belong to a specific domain but can appear in many of them Therefore a Factotum label is created for this purpose It includes two types of synsets 1 Generic synset 2 Stop sense synsets Generic Synsets They are hard to classify in a particular domain Examples Man 1 an adult male person vs woman Man 3 any human being generic Date 1 day of the month Date 3 appointment engagement They are placed high in the hierarchy many verb synsets belong to this category Stop Sense Synsets Include non polysemous words Behave as stop words since they don t contribute to overall sense of text Examples Numbers Weekdays colors Specialistic vs Generic Usages About 250 domain labels in WordNet Domains Some synsets occur


View Full Document

UA CSC 620 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?