UA CSC 620 - Comparing Ontology-based and Corpus- based Domain Annotations in WordNet

Unformatted text preview:

Comparing Ontology-based and Corpus-based Domain Annotations in WordNet.A paper by:Bernardo MagniniCarlo StrapparavaGiovanni PezzuloAlfio GlozzoPresented by:rabee ali alshemaliMotive.Domain information is an emergingtopic of interest in relation toWrodNet.ProposalAn investigation into comparing andintegrating ontology-based andcorpus-based domain information.WordNet Domains(Magnini and Cavaglia 2000).An extension of WordNet 1.6Provides a lexical resource, where WordNetsynsets have been manually annotated withdomain labels, such as: Medicine, Sport,and Architecture.The annotation reflects the lexico-semanticcriteria adopted by humans involved in theannotation and takes advantage of existingconceptual relations in WordNet.Question! How well this annotation reflects the waysynsets occur in a certain text collection ?? Why is this important? It is particularly relevant when we want touse manual annotation for text processingtasks (e.g. Word Sense Disambiguation.)Example to Illustrate:• Consider the following synset: {heroin, diacetyl morphine, horse, junk,scag, smack}.• It is annotated with the Medicine domain becauseheroin is a drug, and that is maybe best described asmedical knowledge.Example to Illustrate: Cont.• On the other hand (on the text side), if weconsider a news collection – Reuters corpusfor example – the word heroin is likely tooccur in the context of either: Crime news.Administrative news.And without any strong relation with themedical field.The moral behind the example: We can clearly see the difference: Manual annotation considers the technicaluse of the word. Text, on the other hand, records a widercontext of use.How to reconcile?• Both sources carry relevant information, sosupporting ontology-based domainannotations with corpus-based distributionwill probably give the best potential forcontent-based text analysis.What is needed?• First Step: a methodology is required toautomatically acquire domain information forsynsets in WordNet from a categorized corpus.• Reuters corpus is used because it is free and neatlyorganized by means of topic codes, which makescomparisons with WorldNet domains easier.Optimal Goal• A large-scale automatic acquisition ofdomain information for WordNet SynsetsHowever,• The investigation was limited to a small setof topic codes.Why is domain informationinteresting?• Due to its utility in many scenarios such as: Word Sense Disambiguation (WSD): whereinformation from domain labels are used toestablish semantic relations among word senses. Text Categorization (TC): Where categories arerepresented as symbolic labels.WordNet Domains.• Domains have been used to mark technical usagesof words.• In dictionaries, it is used only for a small portionof the lexicon. Therefore:• WordNet Domains is an attempt to extend thecoverage of domain labels with an already existinglexical database.• WordNet (version 1.6) Synsets have beenannotated with at least one domain label selectedfrom a set of about 200 labels hierarchicallyorganized.WordNet DomainsDOCTRINESPSYCHOLOGYMYTHOLOGYOCCULTISMPALEOGRAPHYTHEOLOGYARTLITERATUREGRAMMARPSYCHOANALYSISLINGUISTICSRELIGIONASTROLOGYHISTORYARCHAEOLOGYPHILOSOPHYHERALDRYMUSICPHILOLOGYTHEATHREPHOTOGRAPHYWordNet Domains.• Information brought by domains iscomplementary to what is already in WrodNet. Three key Observations:1- A domain my include synsets of differentsyntactic categories, For example: The medicine domain groups together sensesfrom Nouns such as doctor#1, and hospital#1,and also from Verbs, such as operate#1.WordNet Domains2- A domain may include senses from differentWordNet sub-hierarchies, for example: The sport domain contains senses such as: -- Athlete#1, from life_form#1 -- game_equipment#1, from physical_object#1 -- sport#1, from act#2 -- playing_field#1, from location#1WordNet Domains.3- domains may group senses of the same wordinto homogenous clusters, but: side effect  Reduction in word polysemy.WordNet Domains.• The word “bank” has 10 different senses.• Three of them (#1, #3, and #6) can begrouped under the Economy domain.• While #2 and #7 both belong to theGeography and Geology domain.•  Reduction of the polysemy from 10 to 7senses.Transportbank (a flight maneuver…) #10Architecturebank, cant camber ( a slope in the the turn of aroad …) #9Economy, PlayBank (the funds held by a gambling house …) #8Geography, Geologybank, (a long ridge or pile…) #7Economysavings bank, coin bank, money box. #6Factotumbank, (an arrangement of similar objects. #5Architecture,Economybank, bank building (a building …) #4Economybank (a supply or stock held in a reserve) #3Geography, Geologybank (sloping land …) #2EconomyDepository financial institution, bank, banking,banking company. #1 Domains Synset and GlossSenseProcedure for synset annotation.• It is an inheritance-based procedure toautomatically mark synsets• A small number of high level synsets are manuallyannotated with their pertinent domains• An automatic procedure exploits WrodNetrelations (i.e. hyponymy, antonymy, meronymey…)to extend the manual assignments to all reachablesynsets.Example.o Consider the following synset: {beak, bill, neb, nib}o It will be automatically marked with thecode Zoology, starting from the synset {bird}and following “part_of” relation.Issues!Oh man!, why there always have to be issues !? :o)Wrong propagation. Consider: barber_chair#1 is “part_of” barber_shop#1 barber_shop#1 is annotated with Commerce  barber_chair#1 would wrongly inherit the same domain. Therefore, in such cases, the inheritanceprocedure has to be blocked to prevent wrongpropagation.How to fix …• The inheritance procedure allows the declarationsof “exceptions”• Example: Assign shop#1 to Commerce With exception[part, isa, shop#1] which assigns the synset shop#1 to Commerce, butexcludes the parts of the children of shop#1 suchas barbershop#1.Issues. Cont.FACTOTUM: a number of WordNetsynsets do not belong to a specific domain,but can appear in many of them; Therefore,a Factotum label is created for this purpose.• It includes two types of synsets: 1- Generic synset. 2- Stop sense synsets.Generic Synsets.• They are hard to classify in a particular domain.• Examples: Man#1 : an adult male person (vs. woman) Man#3 : any human being (generic) Date#1 : day of the month. Date#3 : appointment,


View Full Document

UA CSC 620 - Comparing Ontology-based and Corpus- based Domain Annotations in WordNet

Download Comparing Ontology-based and Corpus- based Domain Annotations in WordNet
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Comparing Ontology-based and Corpus- based Domain Annotations in WordNet and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Comparing Ontology-based and Corpus- based Domain Annotations in WordNet 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?